diff --git a/INSTALL b/INSTALL index 3c9b54200..4914b8cfa 100644 --- a/INSTALL +++ b/INSTALL @@ -1,554 +1,557 @@ CDS Invenio INSTALLATION ======================== Revision: $Id$ About ===== This document specifies how to build, customize, and install CDS Invenio for the first time. See RELEASE-NOTES if you are upgrading from a previous CDS Invenio release. Contents ======== 0. Prerequisites 1. Quick instructions for the impatient CDS Invenio admin 2. Detailed instructions for the patient CDS Invenio admin 0. Prerequisites ================ Here is the software you need to have around before you start installing CDS Invenio: a) Unix-like operating system. The main development and production platforms for CDS Invenio at CERN are GNU/Linux distributions SLC (RHEL), Debian, and Gentoo, but we also develop on FreeBSD and Mac OS X. Basically any Unix system supporting the software listed below should do. Note that if you are using Debian "Sarge" GNU/Linux, you can install most of the below-mentioned prerequisites and recommendations by running: $ sudo apt-get install libapache2-mod-python2.3 \ apache2-mpm-prefork mysql-server-4.1 mysql-client-4.1 \ python2.3-mysqldb python2.3-4suite \ python2.3-xml python2.3-libxml2 python2.3-libxslt1 \ rxp gnuplot xpdf-utils gs-common antiword catdoc \ wv html2text ppthtml xlhtml clisp gettext You can also install the following packages: $ sudo apt-get install python2.3-psyco sbcl cmucl The last three packages are not available on all Debian "Sarge" GNU/Linux architectures (e.g. not on AMD64), but they are only recommended so you can safely continue without them. Note that you can consult CDS Invenio wiki pages at for more system-specific notes. Note that the web application server should run a Message Transfer Agent (MTA) such as Postfix so that CDS Invenio can email notification alerts or registration information to the end users, contact moderators and reviewers of submitted documents, inform administrators about various runtime system information, etc. b) MySQL server (may be on a remote machine), and MySQL client (must be available locally too). MySQL versions 4.1 or 5.0 are supported. Please set the variable "max_allowed_packet" in your "my.cnf" init file to at least 4M. You may also want to run your MySQL server natively in UTF-8 mode by setting "default-character-set=utf8" in various parts of your "my.cnf" file, such as in the "[mysql]" part and elsewhere. c) Apache 2 server, with support for loading DSO modules, and optionally with SSL support for HTTPS-secure user authentication. Tested mainly with version 2.0.43 and above. Apache 2.x is required for the mod_python module (see below). d) Python v2.3 or above: as well as the following Python modules: - (mandatory) MySQLdb (version >= 1.2.1_p2; see below) - (recommended) PyXML, for XML processing: - (recommended) PyRXP, for very fast XML MARC processing: - (recommended) libxml2-python, for XML/XLST processing: - (recommended) Gnuplot.Py, for producing graphs: - (recommended) Snowball Stemmer, for stemming: - (optional) 4suite, slower alternative to PyRXP and libxml2-python: - (optional) feedparser, for web journal creation: - (optional) Psyco, to speed up the code at places: - (optional) RDFLib, to use RDF ontologies and thesauri: - (optional) mechanize, to run regression web test suite: Note: MySQLdb version 1.2.1_p2 or higher is recommended. If you are using an older version of MySQLdb, you may get into problems with character encoding. e) mod_python Apache module. Tested mainly with versions 3.0BETA4 and above. mod_python 3.x is required for Apache 2. Previous versions (as well as Apache 1 ones) exhibited some problems with MySQL connectivity in our experience. f) If you want to be able to extract references from PDF fulltext files, then you need to install pdftotext version 3 at least. g) If you want to be able to search for words in the fulltext files (i.e. to have fulltext indexing) or to stamp submitted files, then you need as well to install some of the following tools: - for PDF file stamping: pdftk, pdf2ps - for PDF files: pdftotext or pstotext - for PostScript files: pstotext or ps2ascii - for MS Word files: antiword, catdoc, or wvText - for MS PowerPoint files: pptHtml and html2text - for MS Excel files: xlhtml and html2text h) If you have chosen to install fast XML MARC Python processors in the step d) above, then you have to install the parsers themselves: - (optional) RXP: - (optional) 4suite: i) (recommended) Gnuplot, the command-line driven interactive plotting program. It is used to display download and citation history graphs on the Detailed record pages on the web interface. Note that Gnuplot must be compiled with PNG output support, that is, with the GD library. Note also that Gnuplot is not required, only recommended. j) (recommended) A Common Lisp implementation, such as CLISP, SBCL or CMUCL. It is used for the web server log analysing tool and the metadata checking program. Note that any of the three implementations CLISP, SBCL, or CMUCL will do. CMUCL produces fastest machine code, but it does not support UTF-8 yet. Pick up CLISP if you don't know what to do. Note that a Common Lisp implementation is not required, only recommended. k) GNU gettext, a set of tools that makes it possible to translate the application in multiple languages. This is available by default on many systems. Note that the configure script checks whether you have all the prerequisite software installed and that it won't let you continue unless everything is in order. It also warns you if it cannot find some optional but recommended software. 1. Quick instructions for the impatient CDS Invenio admin ========================================================= 1a. Installation ---------------- $ cd /usr/local/src/ $ wget http://cdsware.cern.ch/download/cds-invenio-0.99.0.tar.gz $ wget http://cdsware.cern.ch/download/cds-invenio-0.99.0.tar.gz.md5 $ wget http://cdsware.cern.ch/download/cds-invenio-0.99.0.tar.gz.sig $ md5sum -v -c cds-invenio-0.99.0.tar.gz.md5 $ gpg --verify cds-invenio-0.99.0.tar.gz.sig cds-invenio-0.99.0.tar.gz $ tar xvfz cds-invenio-0.99.0.tar.gz $ cd cds-invenio-0.99.0 $ ./configure $ make $ make install $ make install-jsmath-plugin ## optional 1b. Configuration ----------------- $ emacs /opt/cds-invenio/etc/invenio.conf $ emacs /opt/cds-invenio/etc/invenio-local.conf $ /opt/cds-invenio/bin/inveniocfg --update-all $ /opt/cds-invenio/bin/inveniocfg --create-tables $ /opt/cds-invenio/bin/inveniocfg --create-apache-conf $ sudo /path/to/apache/bin/apachectl graceful $ sudo chgrp -R www-data /opt/cds-invenio $ sudo chmod -R g+r /opt/cds-invenio $ sudo chmod -R g+rw /opt/cds-invenio/var $ sudo find /opt/cds-invenio -type d -exec chmod g+rxw {} \; $ /opt/cds-invenio/bin/inveniocfg --create-demo-site $ /opt/cds-invenio/bin/inveniocfg --load-demo-records $ /opt/cds-invenio/bin/inveniocfg --run-unit-tests $ /opt/cds-invenio/bin/inveniocfg --run-regression-tests $ /opt/cds-invenio/bin/inveniocfg --remove-demo-records $ /opt/cds-invenio/bin/inveniocfg --drop-demo-site $ firefox http://your.site.com/help/admin/howto-run 2. Detailed instructions for the patient CDS Invenio admin ========================================================== 2a. Installation ---------------- The CDS Invenio uses standard GNU autoconf method to build and install its files. This means that you proceed as follows: $ cd /usr/local/src/ Change to a directory where we will configure and build the CDS Invenio. (The built files will be installed into different "target" directories later.) $ wget http://cdsware.cern.ch/download/cds-invenio-0.99.0.tar.gz $ wget http://cdsware.cern.ch/download/cds-invenio-0.99.0.tar.gz.md5 $ wget http://cdsware.cern.ch/download/cds-invenio-0.99.0.tar.gz.sig Fetch CDS Invenio source tarball from the CDS Software Consortium distribution server, together with MD5 checksum and GnuPG cryptographic signature files useful for verifying the integrity of the tarball. $ md5sum -v -c cds-invenio-0.99.0.tar.gz.md5 Verify MD5 checksum. $ gpg --verify cds-invenio-0.99.0.tar.gz.sig cds-invenio-0.99.0.tar.gz Verify GnuPG cryptographic signature. Note that you may first have to import my public key into your keyring, if you haven't done that already: $ gpg --keyserver wwwkeys.eu.pgp.net --recv-keys 0xBA5A2B67 The output of the gpg --verify command should then read: Good signature from "Tibor Simko " You can safely ignore any trusted signature certification warning that may follow after the signature has been successfully verified. $ tar xvfz cds-invenio-0.99.0.tar.gz Untar the distribution tarball. $ cd cds-invenio-0.99.0 Go to the source directory. $ ./configure Configure CDS Invenio software for building on this specific platform. You can use the following optional parameters: --prefix=/opt/cds-invenio Optionally, specify the CDS Invenio general installation directory (default is /opt/cds-invenio). It will contain command-line binaries and program libraries containing the core CDS Invenio functionality, but also store web pages, runtime log and cache information, document data files, etc. Several subdirs like `bin', `etc', `lib', or `var' will be created inside the prefix directory to this effect. Note that the prefix directory should be chosen outside of the Apache htdocs tree, since only one its subdirectory (prefix/var/www) is to be accessible directly via the Web (see below). Note that CDS Invenio won't install to any other directory but to the prefix mentioned in this configuration line. --with-python=/opt/python/bin/python2.3 Optionally, specify a path to some specific Python binary. This is useful if you have more than one Python installation on your system. If you don't set this option, then the first Python that will be found in your PATH will be chosen for running CDS Invenio. --with-mysql=/opt/mysql/bin/mysql Optionally, specify a path to some specific MySQL client binary. This is useful if you have more than one MySQL installation on your system. If you don't set this option, then the first MySQL client executable that will be found in your PATH will be chosen for running CDS Invenio. --with-clisp=/opt/clisp/bin/clisp Optionally, specify a path to CLISP executable. This is useful if you have more than one CLISP installation on your system. If you don't set this option, then the first executable that will be found in your PATH will be chosen for running CDS Invenio. --with-cmucl=/opt/cmucl/bin/lisp Optionally, specify a path to CMUCL executable. This is useful if you have more than one CMUCL installation on your system. If you don't set this option, then the first executable that will be found in your PATH will be chosen for running CDS Invenio. --with-sbcl=/opt/sbcl/bin/sbcl Optionally, specify a path to SBCL executable. This is useful if you have more than one SBCL installation on your system. If you don't set this option, then the first executable that will be found in your PATH will be chosen for running CDS Invenio. This configuration step is mandatory. Usually, you do this step only once. (Note that if you prefer to build CDS Invenio out of its source tree, you may run the above configure command like this: mkdir build && cd build && ../configure --prefix=... FIXME: this is not working right now as per the introduction of intbitset_setup.py.) $ make Launch the CDS Invenio build. Since many messages are printed during the build process, you may want to run it in a fast-scrolling terminal such as rxvt or in a detached screen session. During this step all the pages and scripts will be pre-created and customized based on the config you have edited in the previous step. Note that on systems such as FreeBSD or Mac OS X you have to use GNU make ("gmake") instead of "make". $ make install Install the web pages, scripts, utilities and everything needed for runtime into the respective directories, as specified earlier by the configure command. Note that if you are installing CDS Invenio for the first time, you will be asked to create a symbolic link for the "invenio" Python module from Python's site-packages directory to instruct Python where to find CDS Invenio's Python files. The process will hint you at the exact command to use based on the values you have used in the configure line. (Note also that on some operating systems you might need to create another symlink manually for lib64: $ sudo ln -s /opt/cds-invenio/lib/python/invenio \ /usr/local/lib64/python2.3/site-packages/invenio if you happen to encounter some troubles finding intbitset libraries.) $ sudo make install-jsmath-plugin ## optional This will automatically download and install in the proper place jsMath, a Javascript library to render LaTeX formulas in the client browser. Note that in order to enable the rendering you will have to set later the variable CFG_WEBSEARCH_USE_JSMATH_FOR_FORMATS in the invenio.conf to a suitable list of output format codes like in "['hd', 'hb']". 2b. Configuration ----------------- Once the basic software installation is done, we proceed to configuring your Invenio system. $ emacs /opt/cds-invenio/etc/invenio.conf $ emacs /opt/cds-invenio/etc/invenio-local.conf Customize your CDS Invenio installation. The 'invenio.conf' file contains the vanilla default configuration parameters of a CDS Invenio installation, as coming from the distribution. You could in principle go ahead and change the values according to your local needs. However, you can also create a file named 'invenio-local.conf' in the same directory where 'invenio.conf' lives and put there only the localizations you need to have different from the default ones. For example: $ cat /opt/cds-invenio/etc/invenio-local.conf [Invenio] - WEBURL = http://your.site.com - SWEBURL = https://your.site.com + CFG_SITE_URL = http://your.site.com + CFG_SITE_SECURE_URL = https://your.site.com + CFG_SITE_ADMIN_EMAIL = john.doe@your.site.com + CFG_SITE_SUPPORT_EMAIL = john.doe@your.site.com The Invenio system will then read both the default invenio.conf file and your customized invenio-local.conf file and it will override any default options with the ones you have set in your local file. This cascading of configuration parameters will ease you future upgrades. You should override at least the parameters from the top of invenio.conf file in order to define some very essential runtime parameters such as the visible URL of your document - server (look for WEBURL and SWEBURL), the database - credentials (look for CFG_DATABASE_*), the name of your - document server (look for CDSNAME and CDSNAMEINTL), or the - email address of the local CDS Invenio administrator (look - for SUPPORTEMAIL and ADMINEMAIL). + server (look for CFG_SITE_URL and CFG_SITE_SECURE_URL), the + database credentials (look for CFG_DATABASE_*), the name of + your document server (look for CFG_SITE_NAME and + CFG_SITE_NAME_INTL_*), or the email address of the local CDS + Invenio administrator (look for CFG_SITE_SUPPORT_EMAIL and + CFG_SITE_ADMIN_EMAIL). $ /opt/cds-invenio/bin/inveniocfg --update-all Make the rest of the Invenio system aware of your invenio.conf changes. This step is mandatory each time you edit your conf files. $ /opt/cds-invenio/bin/inveniocfg --create-tables If you are installing CDS Invenio for the first time, you have to create database tables. Note that this step checks for potential problems such as the database connection rights and may ask you to perform some more administrative steps in case it detects a problem. Notably, it may ask you to set up database access permissions, based on your configure values. If you are installing CDS Invenio for the first time, you have to create a dedicated database on your MySQL server that the CDS Invenio can use for its purposes. Please contact your MySQL administrator and ask him to execute the commands this step proposes you. At this point you should now have successfully completed the "make install" process. We continue by setting up the Apache web server. $ /opt/cds-invenio/bin/inveniocfg --create-apache-conf Running this command will generate Apache virtual host configurations matching your installation. You will be instructed to check created files (usually they are located under /opt/cds-invenio/etc/apache/) and edit your httpd.conf to put the following include statements: Include /opt/cds-invenio/etc/apache/invenio-apache-vhost.conf Include /opt/cds-invenio/etc/apache/invenio-apache-vhost-ssl.conf $ sudo /path/to/apache/bin/apachectl graceful Please ask your webserver administrator to restart the Apache server after the above "httpd.conf" changes. $ sudo chgrp -R www-data /opt/cds-invenio $ sudo chmod -R g+r /opt/cds-invenio $ sudo chmod -R g+rw /opt/cds-invenio/var $ sudo find /opt/cds-invenio -type d -exec chmod g+rxw {} \; One more superuser step, because we need to enable Apache server to read files from the installation place and to write some log information and to cache interesting entities inside the "var" subdirectory of our CDS Invenio installation directory. Here we assumed that your Apache server processes are run under "www-data" group. Change this appropriately for your system. Moreover, note that if you are using SELinux extensions (e.g. on Fedora Core 6), you may have to check and enable the write access of Apache user there too. After these admin-level tasks to be performed as root, let's now go back to finish the installation of the CDS Invenio. $ /opt/cds-invenio/bin/inveniocfg --create-demo-site This step is recommended to test your local CDS Invenio installation. It should give you our "Atlantis Institute of Science" demo installation, exactly as you see it at . $ /opt/cds-invenio/bin/inveniocfg --load-demo-records Optionally, load some demo records to be able to test indexing and searching of your local CDS Invenio demo installation. $ /opt/cds-invenio/bin/inveniocfg --run-unit-tests Optionally, you can run the unit test suite to verify the unit behaviour of your local CDS Invenio installation. Note that this command should be run only after you have installed the whole system via `make install'. $ /opt/cds-invenio/bin/inveniocfg --run-regression-tests Optionally, you can run the full regression test suite to verify the functional behaviour of your local CDS Invenio installation. Note that this command requires to have created the demo site and loaded the demo records. Note also that running the regression test suite may alter the database content with junk data, so that rebuilding the demo site is strongly recommended afterwards. $ /opt/cds-invenio/bin/inveniocfg --remove-demo-records Optionally, remove the demo records loaded in the previous step, but keeping otherwise the demo collection, submission, format, and other configurations that you may reuse and modify for your own production purposes. $ /opt/cds-invenio/bin/inveniocfg --drop-demo-site Optionally, drop also all the demo configuration so that you'll end up with a completely blank CDS Invenio system. However, you may want to find it more practical not to drop the demo site configuration but to start customizing from there. $ firefox http://your.site.com/help/admin/howto-run In order to start using your CDS Invenio installation, you can start indexing, formatting and other daemons as indicated in the "HOWTO Run" guide on the above URL. You can also use the Admin Area web interfaces to perform further runtime configurations such as the definition of data collections, document types, document formats, word indexes, etc. Good luck, and thanks for choosing CDS Invenio. - CDS Development Group diff --git a/RELEASE-NOTES b/RELEASE-NOTES index 7fdf399fa..b6ab62a51 100644 --- a/RELEASE-NOTES +++ b/RELEASE-NOTES @@ -1,181 +1,183 @@ -------------------------------------------------------------------- CDS Invenio v0.99.0 is released FIXME 20, 2008 http://cdsware.cern.ch/invenio/news.html -------------------------------------------------------------------- CDS Invenio v0.99.0 was released on FIXME 20, 2008. What's new: ----------- *) FIXME Download: --------- Installation notes: ------------------- Please follow the INSTALL file bundled in the distribution tarball. Upgrade notes: -------------- If you are upgrading from CDS Invenio v0.92.1, then please follow the following steps: - Launch the bibsched monitor and wait until all active bibsched tasks are finished. Then put bibsched daemon into manual mode. - Stop all submission procedures and other write operations. For example, you may want to stop Apache, edit httpd.conf to introduce a global site redirect to a temporary splash page saying that upgrade is in progress, and restart Apache. - Take a backup of your current MySQL database and your CDS Invenio installation directory (usually /opt/cds-invenio). - First of all, note that CDS Invenio v0.99.0 must use MySQL server at least 4.1 and the database must be running in UTF-8 mode. If you have been running olderon MySQL server, or you have not created your database and data in default charset UTF-8 but say in Latin-1, then you must dump and reload all your tables. In order to check which version and charset you have been using, you can run: $ echo "SELECT VERSION()" | /opt/cds-invenio/bin/dbexec $ echo "SHOW CREATE DATABASE cdsinvenio" | /opt/cds-invenio/bin/dbexec $ echo "SHOW CREATE TABLE bib10x" | /opt/cds-invenio/bin/dbexec FIXME: provide some more detailed instructions how to dump data, alter table charset, and load data back. Note that the table definition and data charset altering process may take some time. If you have to upgrade your MySQL server as well, you may want to prepare the migration on another server. - Second of all, you must upgrade some Python modules, namely your MySQLdb module as indicated in the INSTALL file. Note that this will make your current site unusable, but we have stopped Apache already anyway. If you have been using Stemmer, you must upgrade its version as indicated in the INSTALL file. The version is not backward-compatible. - Untar new sources and rerun configure with old prefix argument (your old configure line is available at /opt/cds-invenio-BACKUP/etc/build/config.nice) and install CDS Invenio (see INSTALL file, part 1a). - CDS Invenio v0.99.0 uses a new INI-style of configuration. Please see INSTALL file, part 1b on how to customize your installation. You will have to create 'invenio-local.conf' where you have to merge your old config.wml values (they are available - at /opt/cds-invenio-BACKUP/lib/wml/invenio/config.wml). + at /opt/cds-invenio-BACKUP/lib/wml/invenio/config.wml). Beware + of variable name updates, e.g. 'filedirsize' now became + 'CFG_WEBSUBMIT_FILESYSTEM_BIBDOC_GROUP_LIMIT'. - If you have previously customized your page header and footer via WML, then you should now put your page customizations in a new template skin (please see the WebStyle Admin Guide for more information). Also, if you have edited v0.92.1 CSS style sheet, you may want to merge your changes into new elements of the 0.99.0 style sheet. - If you have customized your /opt/cds-invenio/etc/ files in the last installation (for example to edit the stopwords list or configure language stemmers), then you have to restore your changes into corresponding /opt/cds-invenio/etc files. - Update your database table structure: $ make update-v0.92.1-tables If you are upgrading from previous CDSware releases such as v0.7.1, then you may need to run more update statements of this kind, as indicated in previous RELEASE-NOTES files. (You could also start afresh, copying your old data into a fresh v0.99.0 installation.) - If you have previously customized your page templates (e.g. webbasket_templates.py), then check for changes the new version brings: $ make check-custom-templates This script will check if your customized page templates (see the WebStyle Admin Guide) still conform to the default templates. This gives you a chance to fix your templates before running 'make install', by providing a list of incompatibilities of your custom templates with the new versions of the templates. This step is only useful if you are upgrading and if you have previously customized default page templates. - You have also to run manually several migration scripts: $ python ./modules/webaccess/lib/collection_restrictions_migration_kit.py This script will migrate restricted collections that used Apache user groups to a new firewall-like role definition language. $ python ./modules/websubmit/lib/fulltext_files_migration_kit.py This script will check and update your fulltext file storage system to the new style (e.g. unique document names). $ python ./modules/websession/lib/password_migration_kit.py This script will update your local user table in order to use encrypted passwords for more security. - CDS Invenio v0.99.0 uses new, faster word indexer. You will have to reindex all your indexes by running: $ /opt/cds-invenio/bin/bibindex -u admin -R - You have to edit your httpd.conf in order to make Apache aware of new URLs. Please run: $ /opt/cds-invenio/bin/inveniocfg --create-apache-conf and check differences with your current setup and edit as appropriate. - Restart Apache and check whether everything is alright with your system. Note that the detailed record pages (/record/10) have now a tab-style look by default. You may want to update your formats to this style or else to disable this feature. Please see the WebStyle Admin Guide for more information. - Put the bibsched daemon back into the automatic mode. You are done. Further notes and issues: ------------------------- *) Some modules of this release (e.g. mail submission system) are still experimental and not yet activated. You may have a peek at what is planned, but please do not rely on them. *) The admin-level functionality of several modules is not fully developed or documented yet. What's next: ------------ *) Improving the known issues mentioned above. Strengthening the documentation towards v1.0 release. *) Improving the record editing capabilities. *) Deploying the new URL schema for all pages (admin). - end of file - \ No newline at end of file diff --git a/config/invenio-autotools.conf.in b/config/invenio-autotools.conf.in index 83bb1ec26..6285c73a5 100644 --- a/config/invenio-autotools.conf.in +++ b/config/invenio-autotools.conf.in @@ -1,78 +1,75 @@ ## $Id$ ## This file is part of CDS Invenio. ## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 CERN. ## ## CDS Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## CDS Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with CDS Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. ## DO NOT EDIT THIS FILE. ## YOU SHOULD NOT EDIT THESE VALUES. THEY WERE AUTOMATICALLY ## CALCULATED BY AUTOTOOLS DURING THE "CONFIGURE" STAGE. [Invenio] -VERSION = @VERSION@ -CFG_PATH_PHP = @PHP@ +## Invenio version: +CFG_VERSION = @VERSION@ + +## directories detected from 'configure --prefix ...' parameters: CFG_PREFIX = @prefix@ -BINDIR = @prefix@/bin -PYLIBDIR = @prefix@/lib/python -LOGDIR = @localstatedir@/log -ETCDIR = @prefix@/etc -LOCALEDIR = @prefix@/share/locale -TMPDIR = @localstatedir@/tmp -CACHEDIR = @localstatedir@/cache -WEBDIR = @localstatedir@/www +CFG_BINDIR = @prefix@/bin +CFG_PYLIBDIR = @prefix@/lib/python +CFG_LOGDIR = @localstatedir@/log +CFG_ETCDIR = @prefix@/etc +CFG_LOCALEDIR = @prefix@/share/locale +CFG_TMPDIR = @localstatedir@/tmp +CFG_CACHEDIR = @localstatedir@/cache +CFG_WEBDIR = @localstatedir@/www + +## path to interesting programs: +CFG_PATH_PHP = @PHP@ CFG_PATH_ACROREAD = @ACROREAD@ CFG_PATH_GZIP = @GZIP@ CFG_PATH_GUNZIP = @GUNZIP@ CFG_PATH_TAR = @TAR@ CFG_PATH_DISTILLER = @PS2PDF@ CFG_PATH_GFILE = @FILE@ CFG_PATH_CONVERT = @CONVERT@ CFG_PATH_PDFTOTEXT = @PDFTOTEXT@ CFG_PATH_PDFTK = @PDFTK@ CFG_PATH_PDF2PS = @PDF2PS@ CFG_PATH_PSTOTEXT = @PSTOTEXT@ CFG_PATH_PSTOASCII = @PSTOASCII@ CFG_PATH_ANTIWORD = @ANTIWORD@ CFG_PATH_CATDOC = @CATDOC@ CFG_PATH_WVTEXT = @WVTEXT@ CFG_PATH_PPTHTML = @PPTHTML@ CFG_PATH_XLHTML = @XLHTML@ CFG_PATH_HTMLTOTEXT = @HTMLTOTEXT@ ## CFG_BIBINDEX_PATH_TO_STOPWORDS_FILE -- path to the stopwords file. You ## probably don't want to change this path, although you may want to ## change the content of that file. Note that the file is used by the ## rank engine internally, so it should be given even if stopword ## removal in the indexes is not used. CFG_BIBINDEX_PATH_TO_STOPWORDS_FILE = @prefix@/etc/bibrank/stopwords.kb -## Furthemore, here are some legacy config variables used by -## WebSubmit. FIXME: clean most of them away and rename existing ones -## to fit the CFG_WEBSUBMIT_* naming schema. - -counters = @localstatedir@/data/submit/counters -storage = @localstatedir@/data/submit/storage -filedir = @localstatedir@/data/files -xmlmarc2textmarc = @prefix@/bin/xmlmarc2textmarc -bibupload = @prefix@/bin/bibupload -bibformat = @prefix@/bin/bibformat -bibwords = @prefix@/bin/bibwords -bibconvert = @prefix@/bin/bibconvert -bibconvertconf = @prefix@/etc/bibconvert/config +## helper style of variables for WebSubmit: +CFG_WEBSUBMIT_COUNTERSDIR = @localstatedir@/data/submit/counters +CFG_WEBSUBMIT_STORAGEDIR = @localstatedir@/data/submit/storage +CFG_WEBSUBMIT_FILEDIR = @localstatedir@/data/files +CFG_WEBSUBMIT_BIBCONVERTCONFIGDIR = @prefix@/etc/bibconvert/config ## - end of file - \ No newline at end of file diff --git a/config/invenio.conf b/config/invenio.conf index d3032310d..7c3b8aaa8 100644 --- a/config/invenio.conf +++ b/config/invenio.conf @@ -1,560 +1,567 @@ ## $Id$ ## This file is part of CDS Invenio. ## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 CERN. ## ## CDS Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## CDS Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with CDS Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. ################################################### ## About 'invenio.conf' and 'invenio-local.conf' ## ################################################### ## The 'invenio.conf' file contains the vanilla default configuration ## parameters of a CDS Invenio installation, as coming from the ## distribution. The file should be self-explanatory. Once installed ## in its usual location (usually /opt/cds-invenio/etc), you could in ## principle go ahead and change the values according to your local ## needs. ## ## However, you can also create a file named 'invenio-local.conf' in ## the same directory where 'invenio.conf' lives and put there only ## the localizations you need to have different from the default ones. ## For example: ## ## $ cat /opt/cds-invenio/etc/invenio-local.conf ## [Invenio] -## WEBURL = http://your.site.com -## SWEBURL = https://your.site.com +## CFG_SITE_URL = http://your.site.com +## CFG_SITE_SECURE_URL = https://your.site.com +## CFG_SITE_ADMIN_EMAIL = john.doe@your.site.com +## CFG_SITE_SUPPORT_EMAIL = john.doe@your.site.com ## ## The Invenio system will then read both the default invenio.conf ## file and your customized invenio-local.conf file and it will ## override any default options with the ones you have set in your ## local file. This cascading of configuration parameters will ease ## you future upgrades. [Invenio] ################################### ## Part 1: Essential parameters ## ################################### ## This part defines essential CDS Invenio internal parameters that ## everybody should override, like the name of the server or the email ## address of the local CDS Invenio administrator. -## Specify which MySQL server to use, the name of the database to use, -## and the database access credentials. - +## CFG_DATABASE_* - specify which MySQL server to use, the name of the +## database to use, and the database access credentials. CFG_DATABASE_HOST = localhost CFG_DATABASE_NAME = cdsinvenio CFG_DATABASE_USER = cdsinvenio CFG_DATABASE_PASS = my123p$ss -## WEBURL - specify URL under which your installation will be visible. -## For example, use "http://webserver.domain.com". Do not leave +## CFG_SITE_URL - specify URL under which your installation will be +## visible. For example, use "http://your.site.com". Do not leave ## trailing slash. -WEBURL = http://localhost +CFG_SITE_URL = http://localhost -## SWEBURL - specify secure URL under which your installation secure -## pages such as login or registration will be visible. For example, -## use "https://webserver.domain.com". Do not leave trailing slash. -## If you don't plan on using HTTPS, then you may leave this empty. -SWEBURL = https://localhost +## CFG_SITE_SECURE_URL - specify secure URL under which your +## installation secure pages such as login or registration will be +## visible. For example, use "https://your.site.com". Do not leave +## trailing slash. If you don't plan on using HTTPS, then you may +## leave this empty. +CFG_SITE_SECURE_URL = https://localhost -## CDSNAME -- the visible name of your CDS Invenio installation. -CDSNAME = Atlantis Institute of Fictive Science +## CFG_SITE_NAME -- the visible name of your CDS Invenio installation. +CFG_SITE_NAME = Atlantis Institute of Fictive Science -## CDSNAMEINTL -- the international versions of CDSNAME in various +## CFG_SITE_NAME_INTL -- the international versions of CDSNAME in various ## languages, defined using the standard locale-like language codes. -CDSNAMEINTL_en = Atlantis Institute of Fictive Science -CDSNAMEINTL_fr = Atlantis Institut des Sciences Fictives -CDSNAMEINTL_de = Atlantis Institut der fiktiven Wissenschaft -CDSNAMEINTL_es = Atlantis Instituto de la Ciencia Fictive -CDSNAMEINTL_ca = Institut Atlantis de Ciència Fictícia -CDSNAMEINTL_pt = Instituto Atlantis de Ciência Fictícia -CDSNAMEINTL_it = Atlantis Istituto di Scienza Fittizia -CDSNAMEINTL_ru = Атлантис Институт фиктивных Наук -CDSNAMEINTL_sk = Atlantis Inštitút Fiktívnych Vied -CDSNAMEINTL_cs = Atlantis Institut Fiktivních Věd -CDSNAMEINTL_no = Atlantis Institutt for Fiktiv Vitenskap -CDSNAMEINTL_sv = Atlantis Institut för Fiktiv Vetenskap -CDSNAMEINTL_el = Ινστιτούτο Φανταστικών Επιστημών Ατλαντίδος -CDSNAMEINTL_uk = Інститут вигаданих наук в Атлантісі -CDSNAMEINTL_ja = Fictive 科学のAtlantis の協会 -CDSNAMEINTL_pl = Instytut Fikcyjnej Nauki Atlantis -CDSNAMEINTL_bg = Институт за фиктивни науки Атлантис -CDSNAMEINTL_hr = Institut Fiktivnih Znanosti Atlantis -CDSNAMEINTL_zh_CN = 阿特兰提斯虚拟科学学院 -CDSNAMEINTL_zh_TW = 阿特蘭提斯虛擬科學學院 - -## CDSLANG -- the default language of the interface: -CDSLANG = en - -## CDSLANGS -- list of all languages the user interface should be -## available in, separated by commas. The order specified below will -## be respected on the interface pages. A good default would be to -## use the alphabetical order. Currently supported languages include -## Bulgarian, Catalan, Czech, German, Greek, English, Spanish, French, -## Italian, Japanese, Norwegian, Polish, Portuguese, Russian, Slovak, Swedish, -## and Ukrainian, Chinese (China), Chinese (Taiwan), so that the current -## eventual maximum you can currently select is +CFG_SITE_NAME_INTL_en = Atlantis Institute of Fictive Science +CFG_SITE_NAME_INTL_fr = Atlantis Institut des Sciences Fictives +CFG_SITE_NAME_INTL_de = Atlantis Institut der fiktiven Wissenschaft +CFG_SITE_NAME_INTL_es = Atlantis Instituto de la Ciencia Fictive +CFG_SITE_NAME_INTL_ca = Institut Atlantis de Ciència Fictícia +CFG_SITE_NAME_INTL_pt = Instituto Atlantis de Ciência Fictícia +CFG_SITE_NAME_INTL_it = Atlantis Istituto di Scienza Fittizia +CFG_SITE_NAME_INTL_ru = Атлантис Институт фиктивных Наук +CFG_SITE_NAME_INTL_sk = Atlantis Inštitút Fiktívnych Vied +CFG_SITE_NAME_INTL_cs = Atlantis Institut Fiktivních Věd +CFG_SITE_NAME_INTL_no = Atlantis Institutt for Fiktiv Vitenskap +CFG_SITE_NAME_INTL_sv = Atlantis Institut för Fiktiv Vetenskap +CFG_SITE_NAME_INTL_el = Ινστιτούτο Φανταστικών Επιστημών Ατλαντίδος +CFG_SITE_NAME_INTL_uk = Інститут вигаданих наук в Атлантісі +CFG_SITE_NAME_INTL_ja = Fictive 科学のAtlantis の協会 +CFG_SITE_NAME_INTL_pl = Instytut Fikcyjnej Nauki Atlantis +CFG_SITE_NAME_INTL_bg = Институт за фиктивни науки Атлантис +CFG_SITE_NAME_INTL_hr = Institut Fiktivnih Znanosti Atlantis +CFG_SITE_NAME_INTL_zh_CN = 阿特兰提斯虚拟科学学院 +CFG_SITE_NAME_INTL_zh_TW = 阿特蘭提斯虛擬科學學院 + +## CFG_SITE_LANG -- the default language of the interface: +CFG_SITE_LANG = en + +## CFG_SITE_LANGS -- list of all languages the user interface should +## be available in, separated by commas. The order specified below +## will be respected on the interface pages. A good default would be +## to use the alphabetical order. Currently supported languages +## include Bulgarian, Catalan, Czech, German, Greek, English, Spanish, +## French, Italian, Japanese, Norwegian, Polish, Portuguese, Russian, +## Slovak, Swedish, and Ukrainian, Chinese (China), Chinese (Taiwan), +## so that the current eventual maximum you can currently select is ## "bg,ca,cs,de,el,en,es,fr,hr,it,ja,no,pl,pt,ru,sk,sv,uk,zh_CN,zh_TW". -CDSLANGS = bg,ca,cs,de,el,en,es,fr,hr,it,ja,no,pl,pt,ru,sk,sv,uk,zh_CN,zh_TW - -## ALERTENGINEEMAIL -- the email address from which the alert emails -## will appear to be send: -ALERTENGINEEMAIL = cds.alert@cdsdev.cern.ch +CFG_SITE_LANGS = bg,ca,cs,de,el,en,es,fr,hr,it,ja,no,pl,pt,ru,sk,sv,uk,zh_CN,zh_TW -## SUPPORTEMAIL -- the email address of the support team for this -## installation: -SUPPORTEMAIL = cds.support@cern.ch +## CFG_SITE_SUPPORT_EMAIL -- the email address of the support team for +## this installation: +CFG_SITE_SUPPORT_EMAIL = cds.support@cern.ch -## ADMINEMAIL -- the email address of the 'superuser' for this -## installation. Enter your email address below and login with this -## address when using CDS Invenio administration modules. You will then -## be automatically recognized as superuser of the system. -ADMINEMAIL = cds.support@cern.ch +## CFG_SITE_ADMIN_EMAIL -- the email address of the 'superuser' for +## this installation. Enter your email address below and login with +## this address when using CDS Invenio administration modules. You +## will then be automatically recognized as superuser of the system. +CFG_SITE_ADMIN_EMAIL = cds.support@cern.ch ## CFG_MAX_CACHED_QUERIES -- maximum cached queries number possible ## after reaching this number of cached queries the cache is pruned ## deleting half of the older accessed cached queries. CFG_MAX_CACHED_QUERIES = 10000 # FIXME: change name to express SQL queries ## CFG_MISCUTIL_USE_SQLALCHEMY -- whether to use SQLAlchemy.pool in ## the DB engine of CDS Invenio. It is okay to enable this flag even ## if you have not installed SQLAlchemy. Note that Invenio will loose ## some perfomance if CFG_MISCUTIL_USE_SQLALCHEMY is enabled. CFG_MISCUTIL_USE_SQLALCHEMY = False ## CFG_MISCUTIL_SMTP_HOST -- which server to use as outgoing mail server to ## send outgoing emails generated by the system, for example concerning ## submissions or email notification alerts. CFG_MISCUTIL_SMTP_HOST = localhost ## CFG_MISCUTIL_SMTP_PORT -- which port to use on the outgoing mail server ## defined in the previous step. CFG_MISCUTIL_SMTP_PORT = 25 ## CFG_APACHE_PASSWORD_FILE -- the file where Apache user credentials ## are stored. Must be an absolute pathname. If the value does not ## start by a slash, it is considered to be the filename of a file ## located under prefix/var/tmp directory. This is useful for the ## demo site testing purposes. For the production site, if you plan ## to restrict access to some collections based on the Apache user ## authentication mechanism, you should put here an absolute path to ## your Apache password file. CFG_APACHE_PASSWORD_FILE = demo-site-apache-user-passwords ## CFG_APACHE_GROUP_FILE -- the file where Apache user groups are ## defined. See the documentation of the preceding config variable. CFG_APACHE_GROUP_FILE = demo-site-apache-user-groups ## CFG_CERN_SITE -- do we want to enable CERN-specific code, like the ## one that proposes links to famous HEP sites such as Spires and KEK? ## Put "1" for "yes" and "0" for "no". CFG_CERN_SITE = 0 ################################ ## Part 2: Web page style ## ################################ ## The variables affecting the page style. The most important one is ## the 'template skin' you would like to use. Please refer to the ## WebStyle Admin Guide for more explanation. The other variables are ## listed here mostly for backwards compatibility purposes only. ## CFG_WEBSTYLE_TEMPLATE_SKIN -- what template skin do you want to ## use? CFG_WEBSTYLE_TEMPLATE_SKIN = default ## CFG_WEBSTYLE_CDSPAGEBOXLEFTTOP -- eventual global HTML left top box: CFG_WEBSTYLE_CDSPAGEBOXLEFTTOP = ## CFG_WEBSTYLE_CDSPAGEBOXLEFTBOTTOM -- eventual global HTML left bottom box: CFG_WEBSTYLE_CDSPAGEBOXLEFTBOTTOM = ## CFG_WEBSTYLE_CDSPAGEBOXRIGHTTOP -- eventual global HTML right top box: CFG_WEBSTYLE_CDSPAGEBOXRIGHTTOP = ## CFG_WEBSTYLE_CDSPAGEBOXRIGHTBOTTOM -- eventual global HTML right bottom box: CFG_WEBSTYLE_CDSPAGEBOXRIGHTBOTTOM = ################################## ## Part 3: WebSearch parameters ## ################################## ## This section contains some configuration parameters for WebSearch ## module. Please note that WebSearch is mostly configured on ## run-time via its WebSearch Admin web interface. The parameters ## below are the ones that you do not probably want to modify very ## often during the runtime. (Note that you may modify them ## afterwards too, though.) ## CFG_WEBSEARCH_SEARCH_CACHE_SIZE -- how many queries we want to ## cache in memory per one Apache httpd process? This cache is used ## mainly for "next/previous page" functionality, but it caches also ## "popular" user queries if more than one user happen to search for ## the same thing. Note that large numbers may lead to great memory ## consumption. We recommend a value not greater than 100. CFG_WEBSEARCH_SEARCH_CACHE_SIZE = 100 ## CFG_WEBSEARCH_FIELDS_CONVERT -- if you migrate from an older ## system, you may want to map field codes of your old system (such as ## 'ti') to CDS Invenio/MySQL ("title"). Use Python dictionary syntax ## for the translation table, e.g. {'wau':'author', 'wti':'title'}. ## Usually you don't want to do that, and you would use empty dict {}. CFG_WEBSEARCH_FIELDS_CONVERT = {} ## CFG_WEBSEARCH_SIMPLESEARCH_PATTERN_BOX_WIDTH -- width of the search ## pattern window in the simple search interface, in characters. CFG_WEBSEARCH_SIMPLESEARCH_PATTERN_BOX_WIDTH = 40 ## CFG_WEBSEARCH_ADVANCEDSEARCH_PATTERN_BOX_WIDTH -- width of the ## search pattern window in the advanced search interface, in ## characters. CFG_WEBSEARCH_ADVANCEDSEARCH_PATTERN_BOX_WIDTH = 30 ## CFG_WEBSEARCH_NB_RECORDS_TO_SORT -- how many records do we still ## want to sort? For higher numbers we print only a warning and won't ## perform any sorting other than default 'latest records first', as ## sorting would be very time consuming then. We recommend a value of ## not more than a couple of thousands. CFG_WEBSEARCH_NB_RECORDS_TO_SORT = 1000 ## CFG_WEBSEARCH_CALL_BIBFORMAT -- if a record is being displayed but ## it was not preformatted in the "HTML brief" format, do we want to ## call BibFormatting on the fly? Put "1" for "yes" and "0" for "no". ## Note that "1" will display the record exactly as if it were fully ## preformatted, but it may be slow due to on-the-fly processing; "0" ## will display a default format very fast, but it may not have all ## the fields as in the fully preformatted HTML brief format. Note ## also that this option is active only for old (PHP) formats; the new ## (Python) formats are called on the fly by default anyway, since ## they are much faster. When usure, please set "0" here. CFG_WEBSEARCH_CALL_BIBFORMAT = 0 ## CFG_WEBSEARCH_USE_ALEPH_SYSNOS -- do we want to make old SYSNOs ## visible rather than MySQL's record IDs? You may use this if you ## migrate from a different e-doc system, and you store your old ## system numbers into 970__a. Put "1" for "yes" and "0" for ## "no". Usually you don't want to do that, though. CFG_WEBSEARCH_USE_ALEPH_SYSNOS = 0 ## CFG_WEBSEARCH_I18N_LATEST_ADDITIONS -- Put "1" if you want the ## "Latest Additions" in the web collection pages to show ## internationalized records. Useful only if your brief BibFormat ## templates contains internationalized strings. Otherwise put "0" in ## order not to slow down the creation of latest additions by WebColl. CFG_WEBSEARCH_I18N_LATEST_ADDITIONS = 0 ## CFG_WEBSEARCH_INSTANT_BROWSE -- the number of records to display ## under 'Latest Additions' in the web collection pages. CFG_WEBSEARCH_INSTANT_BROWSE = 10 ## CFG_WEBSEARCH_INSTANT_BROWSE_RSS -- the number of records to ## display in the RSS feed. CFG_WEBSEARCH_INSTANT_BROWSE_RSS = 25 ## CFG_WEBSEARCH_RSS_TTL -- number of minutes that indicates how long ## a feed cache is valid. CFG_WEBSEARCH_RSS_TTL = 360 ## CFG_WEBSEARCH_RSS_MAX_CACHED_REQUESTS -- maximum number of request kept ## in cache. If the cache is filled, following request are not cached. CFG_WEBSEARCH_RSS_MAX_CACHED_REQUESTS = 1000 ## CFG_WEBSEARCH_AUTHOR_ET_AL_THRESHOLD -- up to how many author names ## to print explicitely; for more print "et al". Note that this is ## used in default formatting that is seldomly used, as usually ## BibFormat defines all the format. The value below is only used ## when BibFormat fails, for example. CFG_WEBSEARCH_AUTHOR_ET_AL_THRESHOLD = 3 ## CFG_WEBSEARCH_NARROW_SEARCH_SHOW_GRANDSONS -- whether to show or ## not collection grandsons in Narrow Search boxes (sons are shown by ## default, grandsons are configurable here). Use 0 for no and 1 for ## yes. CFG_WEBSEARCH_NARROW_SEARCH_SHOW_GRANDSONS = 1 ## CFG_WEBSEARCH_CREATE_SIMILARLY_NAMED_AUTHORS_LINK_BOX -- shall we ## create help links for Ellis, Nick or Ellis, Nicholas and friends ## when Ellis, N was searched for? Useful if you have one author ## stored in the database under several name formats, namely surname ## comma firstname and surname comma initial cataloging policy. Use 0 ## for no and 1 for yes. CFG_WEBSEARCH_CREATE_SIMILARLY_NAMED_AUTHORS_LINK_BOX = 1 ## CFG_WEBSEARCH_USE_JSMATH_FOR_FORMATS -- jsMath is a Javascript ## library that renders (La)TeX mathematical formulas in the client ## browser. This parameter must contain a list of output format for ## which to apply jsMath rendering, for example "['hd', 'hb']". If ## the list is empty, jsMath is disabled. CFG_WEBSEARCH_USE_JSMATH_FOR_FORMATS = [] ####################################### ## Part 4: BibHarvest OAI parameters ## ####################################### ## This part defines parameters for the CDS Invenio OAI gateway. ## Useful if you are running CDS Invenio as OAI data provider. ## CFG_OAI_ID_FIELD -- OAI identifier MARC field: CFG_OAI_ID_FIELD = 909COo ## CFG_OAI_SET_FIELD -- OAI set MARC field: CFG_OAI_SET_FIELD = 909COp ## CFG_OAI_DELETED_POLICY -- OAI deletedrecordspolicy ## (no/transient/persistent). CFG_OAI_DELETED_POLICY = no ## CFG_OAI_ID_PREFIX -- OAI identifier prefix: CFG_OAI_ID_PREFIX = atlantis.cern.ch ## CFG_OAI_SAMPLE_IDENTIFIER -- OAI sample identifier: CFG_OAI_SAMPLE_IDENTIFIER = oai:atlantis.cern.ch:CERN-TH-4036 ## CFG_OAI_IDENTIFY_DESCRIPTION -- description for the OAI Identify verb: CFG_OAI_IDENTIFY_DESCRIPTION = oai atlantis.cern.ch : oai:atlantis.cern.ch:CERN-TH-4036 http://atlantis.cern.ch/ Free and unlimited use by anybody with obligation to refer to original record Full content, i.e. preprints may not be harvested by robots Submission restricted. Submitted documents are subject of approval by OAI repository admins. ## CFG_OAI_LOAD -- OAI number of records in a response: CFG_OAI_LOAD = 1000 ## CFG_OAI_EXPIRE -- OAI resumptionToken expiration time: CFG_OAI_EXPIRE = 90000 ## CFG_OAI_SLEEP -- service unavailable between two consecutive ## requests for CFG_OAI_SLEEP seconds: CFG_OAI_SLEEP = 10 ################################## ## Part 5: WebSubmit parameters ## ################################## ## This section contains some configuration parameters for WebSubmit ## module. Please note that WebSubmit is mostly configured on ## run-time via its WebSubmit Admin web interface. The parameters ## below are the ones that you do not probably want to modify during ## the runtime. -## filedirsize -- all attached fulltext files are stored -## under the CFG_FILE_DIR directory, inside subdirectories called gX -## this variable indicates the maximum number of files stored in each -## subdirectories. -filedirsize = 5000 +## CFG_WEBSUBMIT_FILESYSTEM_BIBDOC_GROUP_LIMIT -- the fulltext +## documents are stored under "/opt/cds-invenio/var/data/files/gX/Y" +## directories where X is 0,1,... and Y stands for bibdoc ID. Thusly +## documents Y are grouped into directories X and this variable +## indicates the maximum number of documents Y stored in each +## directory X. This limit is imposed solely for filesystem +## performance reasons in order not to have too many subdirectories in +## a given directory. +CFG_WEBSUBMIT_FILESYSTEM_BIBDOC_GROUP_LIMIT = 5000 ################################# ## Part 6: BibIndex parameters ## ################################# ## This section contains some configuration parameters for BibIndex ## module. Please note that BibIndex is mostly configured on run-time ## via its BibIndex Admin web interface. The parameters below are the ## ones that you do not probably want to modify very often during the ## runtime. ## CFG_BIBINDEX_FULLTEXT_INDEX_LOCAL_FILES_ONLY -- when fulltext indexing, do ## you want to index locally stored files only, or also external URLs? ## Use "0" to say "no" and "1" to say "yes". CFG_BIBINDEX_FULLTEXT_INDEX_LOCAL_FILES_ONLY = 0 ## CFG_BIBINDEX_REMOVE_STOPWORDS -- when indexing, do we want to remove ## stopwords? Use "0" to say "no" and "1" to say "yes". CFG_BIBINDEX_REMOVE_STOPWORDS = 0 ## CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS -- characters considered as ## alphanumeric separators of word-blocks inside words. You probably ## don't want to change this. CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS = \!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\@\[\\\]\^\_\`\{\|\}\~ # FIXME: maybe remove backslashes ## CFG_BIBINDEX_CHARS_PUNCTUATION -- characters considered as punctuation ## between word-blocks inside words. You probably don't want to ## change this. CFG_BIBINDEX_CHARS_PUNCTUATION = \.\,\:\;\?\!\"\(\)\'\`\<\> # FIXME: maybe remove backslashes ## CFG_BIBINDEX_REMOVE_HTML_MARKUP -- should we attempt to remove HTML markup ## before indexing? Use 1 if you have HTML markup inside metadata ## (e.g. in abstracts), use 0 otherwise. CFG_BIBINDEX_REMOVE_HTML_MARKUP = 0 ## CFG_BIBINDEX_REMOVE_LATEX_MARKUP -- should we attempt to remove LATEX markup ## before indexing? Use 1 if you have LATEX markup inside metadata ## (e.g. in abstracts), use 0 otherwise. CFG_BIBINDEX_REMOVE_LATEX_MARKUP = 0 ## CFG_BIBINDEX_MIN_WORD_LENGTH -- minimum word length allowed to be added to ## index. The terms smaller then this amount will be discarded. ## Useful to keep the database clean, however you can safely leave ## this value on 0 for up to 1,000,000 documents. CFG_BIBINDEX_MIN_WORD_LENGTH = 0 ## CFG_BIBINDEX_URLOPENER_USERNAME and CFG_BIBINDEX_URLOPENER_PASSWORD -- ## access credentials to access restricted URLs, interesting only if ## you are fulltext-indexing files located on a remote server that is ## only available via username/password. But it's probably better to ## handle this case via IP or some convention; the current scheme is ## mostly there for demo only. CFG_BIBINDEX_URLOPENER_USERNAME = mysuperuser CFG_BIBINDEX_URLOPENER_PASSWORD = mysuperpass ## CFG_INTBITSET_ENABLE_SANITY_CHECKS -- ## Enable sanity checks for integers passed to the intbitset data ## structures. It is good to enable this during debugging ## and to disable this value for speed improvements. CFG_INTBITSET_ENABLE_SANITY_CHECKS = False ####################################### ## Part 7: Access control parameters ## ####################################### ## This section contains some configuration parameters for the access ## control system. Please note that WebAccess is mostly configured on ## run-time via its WebAccess Admin web interface. The parameters ## below are the ones that you do not probably want to modify very ## often during the runtime. (If you do want to modify them during ## runtime, for example te deny access temporarily because of backups, ## you can edit access_control_config.py directly, no need to get back ## here and no need to redo the make process.) ## CFG_ACCESS_CONTROL_LEVEL_SITE -- defines how open this site is. ## Use 0 for normal operation of the site, 1 for read-only site (all ## write operations temporarily closed), 2 for site fully closed. ## Useful for site maintenance. CFG_ACCESS_CONTROL_LEVEL_SITE = 0 ## CFG_ACCESS_CONTROL_LEVEL_GUESTS -- guest users access policy. Use ## 0 to allow guest users, 1 not to allow them (all users must login). CFG_ACCESS_CONTROL_LEVEL_GUESTS = 0 ## CFG_ACCESS_CONTROL_LEVEL_ACCOUNTS -- account registration and ## activation policy. When 0, users can register and accounts are ## automatically activated. When 1, users can register but admin must ## activate the accounts. When 2, users cannot register nor update ## their email address, only admin can register accounts. When 3, ## users cannot register nor update email address nor password, only ## admin can register accounts. When 4, the same as 3 applies, nor ## user cannot change his login method. CFG_ACCESS_CONTROL_LEVEL_ACCOUNTS = 0 ## CFG_ACCESS_CONTROL_LIMIT_REGISTRATION_TO_DOMAIN -- limit account ## registration to certain email addresses? If wanted, give domain ## name below, e.g. "cern.ch". If not wanted, leave it empty. CFG_ACCESS_CONTROL_LIMIT_REGISTRATION_TO_DOMAIN = ## CFG_ACCESS_CONTROL_NOTIFY_ADMIN_ABOUT_NEW_ACCOUNTS -- send a ## notification email to the administrator when a new account is ## created? Use 0 for no, 1 for yes. CFG_ACCESS_CONTROL_NOTIFY_ADMIN_ABOUT_NEW_ACCOUNTS = 0 ## CFG_ACCESS_CONTROL_NOTIFY_USER_ABOUT_NEW_ACCOUNT -- send a ## notification email to the user when a new account is created in order to ## to verify the validity of the provided email address? Use ## 0 for no, 1 for yes. CFG_ACCESS_CONTROL_NOTIFY_USER_ABOUT_NEW_ACCOUNT = 1 ## CFG_ACCESS_CONTROL_NOTIFY_USER_ABOUT_ACTIVATION -- send a ## notification email to the user when a new account is activated? ## Use 0 for no, 1 for yes. CFG_ACCESS_CONTROL_NOTIFY_USER_ABOUT_ACTIVATION = 0 ## CFG_ACCESS_CONTROL_NOTIFY_USER_ABOUT_DELETION -- send a ## notification email to the user when a new account is deleted or ## account demand rejected? Use 0 for no, 1 for yes. CFG_ACCESS_CONTROL_NOTIFY_USER_ABOUT_DELETION = 0 ############################### ## FIXME: Undocumented ones: ## ############################### ## BibRank: CFG_BIBRANK_SHOW_READING_STATS = 1 CFG_BIBRANK_SHOW_DOWNLOAD_STATS = 1 CFG_BIBRANK_SHOW_DOWNLOAD_GRAPHS = 1 CFG_BIBRANK_SHOW_DOWNLOAD_GRAPHS_CLIENT_IP_DISTRIBUTION = 0 CFG_BIBRANK_SHOW_CITATION_LINKS = 1 CFG_BIBRANK_SHOW_CITATION_STATS = 1 CFG_BIBRANK_SHOW_CITATION_GRAPHS = 1 ## WebComment: CFG_WEBCOMMENT_ALLOW_COMMENTS = 1 CFG_WEBCOMMENT_ALLOW_REVIEWS = 1 CFG_WEBCOMMENT_ALLOW_SHORT_REVIEWS = 0 CFG_WEBCOMMENT_NB_REPORTS_BEFORE_SEND_EMAIL_TO_ADMIN = 5 CFG_WEBCOMMENT_NB_COMMENTS_IN_DETAILED_VIEW = 1 CFG_WEBCOMMENT_NB_REVIEWS_IN_DETAILED_VIEW = 1 CFG_WEBCOMMENT_ADMIN_NOTIFICATION_LEVEL = 1 CFG_WEBCOMMENT_TIMELIMIT_PROCESSING_COMMENTS_IN_SECONDS = 20 CFG_WEBCOMMENT_TIMELIMIT_PROCESSING_REVIEWS_IN_SECONDS = 20 # FIXME: not found in modules subdir?! CFG_WEBCOMMENT_TIMELIMIT_VOTE_VALIDITY_IN_DAYS = 365 # FIXME: not found in modules subdir?! CFG_WEBCOMMENT_TIMELIMIT_REPORT_VALIDITY_IN_DAYS = 100 - ## BibSched: CFG_BIBSCHED_REFRESHTIME = 5 # CFG_BIBSCHED_LOG_PAGER = "/bin/more" CFG_BIBSCHED_LOG_PAGER = None +## WebAlert: + +## CFG_WEBALERT_ALERT_ENGINE_EMAIL -- the email address from which the +## alert emails will appear to be send: +CFG_WEBALERT_ALERT_ENGINE_EMAIL = cds.alert@cdsdev.cern.ch + ########################## ## THAT's ALL, FOLKS! ## ########################## \ No newline at end of file diff --git a/modules/bibclassify/lib/bibclassify_daemon.py b/modules/bibclassify/lib/bibclassify_daemon.py index 05dc0a4b8..2c5798a45 100644 --- a/modules/bibclassify/lib/bibclassify_daemon.py +++ b/modules/bibclassify/lib/bibclassify_daemon.py @@ -1,175 +1,175 @@ # -*- coding: utf-8 -*- ## ## This file is part of CDS Invenio. ## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007 CERN. ## ## CDS Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## CDS Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with CDS Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """ BibClassify daemon. FIXME: the code below requires collection table to be updated to add column: clsMETHOD_fk mediumint(9) unsigned NOT NULL, This is not clean and should be fixed. """ __revision__ = "$Id$" import sys from invenio.dbquery import run_sql from invenio.bibtask import task_init, write_message, get_datetime, \ task_set_option, task_get_option, task_get_task_param, task_update_status, \ task_update_progress from invenio.bibclassifylib import generate_keywords_rdf from invenio.config import * from os import popen, remove, listdir import sys from invenio.intbitset import intbitset from invenio.search_engine import get_collection_reclist from invenio.bibdocfile import BibRecDocs import time import os def get_recids_foreach_ontology(): """Returns an array containing hash objects containing the collection, its corresponding ontology and the records belonging to the given collection.""" rec_onts = [] res = run_sql("""SELECT clsMETHOD.name, last_updated, collection.name FROM clsMETHOD JOIN collection_clsMETHOD ON clsMETHOD.id=id_clsMETHOD JOIN collection ON id_collection=collection.id""") for ontology, date_last_run, collection in res: recs = get_collection_reclist(collection) if recs: if not date_last_run: date_last_run = '0000-00-00' modified_records = intbitset(run_sql("SELECT id FROM bibrec WHERE modification_date >=%s", (date_last_run, ))) recs &= modified_records if recs: rec_onts.append({ 'ontology' : ontology, 'collection' : collection, 'recIDs' : recs }) return rec_onts def update_date_of_last_run(): """ Update bibclassify daemon table information about last run time. """ run_sql("UPDATE clsMETHOD SET last_updated=NOW()") def task_run_core(): """Runs anayse_documents for each ontology,collection,record ids set.""" - outfilename = tmpdir + "/bibclassifyd_%s.xml" % time.strftime("%Y%m%dH%M%S", time.localtime()) + outfilename = CFG_TMPDIR + "/bibclassifyd_%s.xml" % time.strftime("%Y%m%dH%M%S", time.localtime()) outfiledesc = open(outfilename, "w") coll_counter = 0 print >> outfiledesc, """""" print >> outfiledesc, """""" for onto_rec in get_recids_foreach_ontology(): write_message('Applying taxonomy %s to collection %s (%s records)' % (onto_rec['ontology'], onto_rec['collection'], len(onto_rec['recIDs']))) if onto_rec['recIDs']: coll_counter += analyse_documents(onto_rec['recIDs'], onto_rec['ontology'], onto_rec['collection'], outfilename, outfiledesc) print >> outfiledesc, '' outfiledesc.close() if coll_counter: - cmd = "%s/bibupload -n -c '%s' " % (bindir, outfilename) + cmd = "%s/bibupload -n -c '%s' " % (CFG_BINDIR, outfilename) errcode = 0 try: errcode = os.system(cmd) except OSError, e: print 'command' + cmd + ' failed ',e if errcode != 0: write_message("WARNING, %s failed, error code is %s" % (cmd,errcode)) return 0 update_date_of_last_run() return 1 def analyse_documents(recs, ontology, collection, outfilename, outfiledesc): """For each collection, parse the documents attached to the records in collection with the corresponding ontology.""" time_now = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) did_something = False counter = 1 max = len(recs) # store running time: # see which records we'll have to process: #recIDs = get_recIDs_of_modified_records_since_last_run() temp_text = None if recs: # process records: cmd = None path = None - temp_text = tmpdir + '/bibclassify.pdftotext.' + str(os.getpid()) + temp_text = CFG_TMPDIR + '/bibclassify.pdftotext.' + str(os.getpid()) for rec in recs: bibdocfiles = BibRecDocs(rec).list_latest_files() found_one_pdf = False for bibdocfile in bibdocfiles: if bibdocfile.get_format() == '.pdf': found_one_pdf = True if found_one_pdf: did_something = True print >> outfiledesc, '' print >> outfiledesc, """%(recID)s""" % ({'recID':rec}) for f in bibdocfiles: if f.get_format() == '.pdf': cmd = "%s '%s' '%s'" % (CFG_PATH_PDFTOTEXT, f.get_full_path(), temp_text) else: write_message("Can't parse file %s." % f.get_full_path(), verbose=3) continue errcode = os.system(cmd) if errcode != 0 or not os.path.exists("%s" % temp_text): write_message("Error while executing command %s Error code was: %s " % (cmd, errcode)) write_message('Generating keywords for %s' % f.get_full_path()) - print >> outfiledesc, generate_keywords_rdf(temp_text, etcdir + '/bibclassify/' + ontology + '.rdf', 2, 70, 25, 0, False, verbose=0, ontology=ontology) + print >> outfiledesc, generate_keywords_rdf(temp_text, CFG_ETCDIR + '/bibclassify/' + ontology + '.rdf', 2, 70, 25, 0, False, verbose=0, ontology=ontology) print >> outfiledesc, '' task_update_progress("Done %s of %s for collction %s." % (counter, max, collection)) counter += 1 else: write_message("Nothing to be done, move along") return did_something def cleanup_tmp(): """Remove old temporary files created by this module""" - for f in listdir(tmpdir): - if 'bibclassify' in f: remove(tmpdir + '/' +f) + for f in listdir(CFG_TMPDIR): + if 'bibclassify' in f: remove(CFG_TMPDIR + '/' +f) def main(): """Constructs the bibclassifyd bibtask.""" cleanup_tmp() task_init(authorization_action='runbibclassify', authorization_msg="BibClassify Task Submission", description="""Examples: %s -u admin """ % (sys.argv[0],), version=__revision__, task_run_fnc = task_run_core) if __name__ == '__main__': main() # FIXME: one can have more than one ontologies in clsMETHOD. # bibclassifyd -w HEP,Pizza # FIXME: add more CLI options like bibindex ones, e.g. # bibclassifyd -a -i 10-20 # FIXME: outfiledesc can be multiple files, e.g. when processing # 100000 records it is good to store results by 1000 records # (see oaiharvest) diff --git a/modules/bibclassify/lib/bibclassifylib.py b/modules/bibclassify/lib/bibclassifylib.py index 0d3a66a7b..d94b3869c 100644 --- a/modules/bibclassify/lib/bibclassifylib.py +++ b/modules/bibclassify/lib/bibclassifylib.py @@ -1,805 +1,805 @@ # -*- coding: utf-8 -*- ## ## This file is part of CDS Invenio. ## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 CERN. ## ## CDS Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## CDS Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with CDS Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """ Bibclassify keyword extractor command line entry point. """ __revision__ = "$Id$" import getopt import string import os import re import sys import time import copy import shelve from invenio.bibtask import write_message # Please point the following variables to the correct paths if using standalone (Invenio-independent) version TMPDIR_STANDALONE = "/tmp" PDFTOTEXT_STANDALONE = "/usr/bin/pdftotext" fontSize = [12, 14, 16, 18, 20, 22, 24, 26, 28, 30] def usage(code, msg=''): "Prints usage for this module." if msg: sys.stderr.write("Error: %s.\n" % msg) usagetext = """ Usage: bibclassify [options] Examples: bibclassify -f file.pdf -k thesaurus.txt -o TEXT bibclassify -f file.txt -K taxonomy.rdf -l 120 -m FULL Specific options: -f, --file=FILENAME name of the file to be classified (Use '.pdf' extension for PDF files; every other extension is treated as text) -k, --thesaurus=FILENAME name of the text thesaurus (one keyword per line) -K, --taxonomy=FILENAME name of the RDF SKOS taxonomy/ontology (a local file or URL) -o, --output=HTML|TEXT|MARCXML output list of keywords in either HTML, text, or MARCXML -l, --limit=INTEGER maximum number of keywords that will be processed to generate results (the higher the l, the higher the number of possible composite keywords) -n, --nkeywords=INTEGER maximum number of single keywords that will be generated -m, --mode=FULL|PARTIAL processing mode: PARTIAL (run on abstract and selected pages), FULL (run on whole document - more accurate, but slower) -q, --spires outputs composite keywords in the SPIRES standard format (ckw1, ckw2) General options: -h, --help print this help and exit -V, --version print version and exit -v, --verbose=LEVEL Verbose level (0=min, 1=default, 9=max). """ sys.stderr.write(usagetext) sys.exit(code) def generate_keywords(textfile, dictfile, verbose=0): """ A method that generates a sorted list of keywords of a document (textfile) based on a simple thesaurus (dictfile). """ keylist = [] keyws = [] wordlista = os.popen("more " + dictfile) thesaurus = [x[:-1] for x in wordlista.readlines()] for keyword in thesaurus: try: string.atoi(keyword) except ValueError: dummy = 1 else: continue if len(keyword)<=1: #whitespace or one char - get rid of continue else: dictOUT = os.popen('grep -iwc "' +keyword.strip()+'" '+textfile).read() try: occur = int(dictOUT) if occur != 0: keylist.append([occur, keyword]) except ValueError: continue keylist.sort() keylist.reverse() for item in keylist: keyws.append(item[1]) return keyws def generate_keywords_rdf(textfile, dictfile, output, limit, nkeywords, mode, spires, verbose=0, ontology=None): """ A method that generates a sorted list of keywords (text or html output) based on a RDF thesaurus. """ import rdflib keylist = [] ckwlist = {} outlist = [] compositesOUT = [] compositesTOADD = [] keys2drop = [] raw = [] composites = {} compositesIDX = {} text_out = "" html_out = [] store = None reusing_compiled_ontology_p = False compiled_ontology_db = None compiled_ontology_db_file = dictfile + '.db' namespace = rdflib.Namespace("http://www.w3.org/2004/02/skos/core#") if not(os.access(dictfile,os.F_OK) and os.access(compiled_ontology_db_file,os.F_OK) and os.path.getmtime(compiled_ontology_db_file) > os.path.getmtime(dictfile)): # changed graph type, recommended by devel team store = rdflib.ConjunctiveGraph() store.parse(dictfile) compiled_ontology_db = shelve.open(compiled_ontology_db_file) compiled_ontology_db['graph'] = store if verbose >= 3: write_message("Creating compiled ontology %s for the first time" % compiled_ontology_db_file, sys.stderr) else: if verbose >= 3: write_message("Reusing compiled ontology %s" % compiled_ontology_db_file, sys.stderr) reusing_compiled_ontology_p = True compiled_ontology_db = shelve.open(compiled_ontology_db_file) store = compiled_ontology_db['graph'] size = int(os.stat(textfile).st_size) rtmp = open(textfile, 'r') atmp = open(textfile, 'r') # ASSUMPTION: Guessing that the first 10% of file contains title and abstract abstract = " " + str(atmp.read(int(size*0.1))) + " " if mode == 1: # Partial mode: analysing only abstract + title + middle portion of document # Abstract and title is generally never more than 20% of whole document. text_string = " " + str(rtmp.read(int(size*0.2))) throw_away = str(rtmp.read(int(size*0.25))) text_string += str(rtmp.read(int(size*0.2))) else: # Full mode: get all document text_string = " " + str(rtmp.read()) + " " atmp.close() rtmp.close() try: # Here we are trying to match the human-assigned keywords # These are generally found in a document after the key phrase "keywords" or similar if text_string.find("Keywords:"): safe_keys = text_string.split("Keywords:")[1].split("\n")[0] elif text_string.find("Key words:"): safe_keys = text_string.split("Key words:")[1].split("\n")[0] elif text_string.find("Key Words:"): safe_keys = text_string.split("Key Words:")[1].split("\n")[0] except: safe_keys = "" if safe_keys != "": write_message("Author keyword string detected: %s" % safe_keys, verbose=8) # Here we start the big for loop around all concepts in the RDF ontology if not reusing_compiled_ontology_p: # we have to compile ontology first: for s,pref in store.subject_objects(namespace["prefLabel"]): dictOUT = 0 safeOUT = 0 hideOUT = 0 candidates = [] wildcard = "" regex = False nostandalone = False # For each concept, we gather the candidates (i.e. prefLabel, hiddenLabel and altLabel) candidates.append(pref.strip()) # If the candidate is a ckw and it has no altLabel, we are not interested at this point, go to the next item if store.value(s,namespace["compositeOf"],default=False,any=True) and not store.value(s,namespace["altLabel"],default=False,any=True): continue if str(store.value(s,namespace["note"],any=True)) == "nostandalone": nostandalone = True for alt in store.objects(s, namespace["altLabel"]): candidates.append(alt.strip()) for hid in store.objects(s, namespace["hiddenLabel"]): candidates.append(hid.strip()) # We then create a regex pattern for each candidate and we match it in the document # First we match any possible candidate containing regex. These have to be handled a priori # (because they might cause double matching, e.g. "gauge theor*" will match "gauge theory" for candidate in candidates: if candidate.find("/", 0, 1) > -1: # We have a wildcard or other regex, do not escape chars # Wildcards matched with '\w*'. These truncations should go into hidden labels in the ontology regex = True pattern = makePattern(candidate, 3) wildcard = pattern hideOUT += len(re.findall(pattern,text_string)) # print "HIDEOUT: " + str(candidate) + " " + str(hideOUT) for candidate in candidates: # Different patterns are created according to the type of candidate keyword encountered if candidate.find("/", 0, 1) > -1: # We have already taken care of this continue elif regex and candidate.find("/", 0, 1) == -1 and len(re.findall(wildcard," " + candidate + " ")) > 0: # The wildcard in hiddenLabel matches this candidate: skip it # print "\ncase 2 touched\n" continue elif candidate.find("-") > -1: # We have an hyphen -> e.g. "word-word". Look for: "word-word", "wordword", "word word" (case insensitive) pattern = makePattern(candidate, 2) elif candidate[:2].isupper() or len(candidate) < 3: # First two letters are uppercase or very short keyword. This could be an acronym. Better leave case untouched pattern = makePattern(candidate, 1) else: # Let's do some plain case insensitive search pattern = makePattern(candidate, 0) if len(candidate) < 3: # We have a short keyword if len(re.findall(pattern,abstract))> 0: # The short keyword appears in the abstract/title, retain it dictOUT += len(re.findall(pattern,text_string)) safeOUT += len(re.findall(pattern,safe_keys)) else: dictOUT += len(re.findall(pattern,text_string)) safeOUT += len(re.findall(pattern,safe_keys)) dictOUT += hideOUT if dictOUT > 0 and store.value(s,namespace["compositeOf"],default=False,any=True): # This is a ckw whose altLabel occurs in the text ckwlist[s.strip()] = dictOUT elif dictOUT > 0: keylist.append([dictOUT, s.strip(), pref.strip(), safeOUT, candidates, nostandalone]) regex = False keylist.sort() keylist.reverse() compiled_ontology_db['keylist'] = keylist compiled_ontology_db.close() else: # we can reuse compiled ontology: keylist = compiled_ontology_db['keylist'] compiled_ontology_db.close() if limit > len(keylist): limit = len(keylist) if nkeywords > limit: nkeywords = limit # Sort out composite keywords based on limit (default=70) # Work out whether among l single keywords, there are possible composite combinations # Generate compositesIDX dictionary of the form: s (URI) : keylist for i in range(limit): try: if store.value(rdflib.Namespace(keylist[i][1]),namespace["composite"],default=False,any=True): compositesIDX[keylist[i][1]] = keylist[i] for composite in store.objects(rdflib.Namespace(keylist[i][1]),namespace["composite"]): if composites.has_key(composite): composites[composite].append(keylist[i][1]) else: composites[composite]=[keylist[i][1]] elif store.value(rdflib.Namespace(keylist[i][1]),namespace["compositeOf"],default=False,any=True): compositesIDX[keylist[i][1]] = keylist[i] else: outlist.append(keylist[i]) except: write_message("Problem with composites.. : %s" % keylist[i][1]) for s_CompositeOf in composites: if len(composites.get(s_CompositeOf)) > 2: write_message("%s - Sorry! Only composite combinations of max 2 keywords are supported at the moment." % s_CompositeOf) elif len(composites.get(s_CompositeOf)) > 1: # We have a composite match. Need to look for composite1 near composite2 comp_one = compositesIDX[composites.get(s_CompositeOf)[0]][2] comp_two = compositesIDX[composites.get(s_CompositeOf)[1]][2] # Now check that comp_one and comp_two really correspond to ckw1 : ckw2 if store.value(rdflib.Namespace(s_CompositeOf),namespace["prefLabel"],default=False,any=True).split(":")[0].strip() == comp_one: # order is correct searchables_one = compositesIDX[composites.get(s_CompositeOf)[0]][4] searchables_two = compositesIDX[composites.get(s_CompositeOf)[1]][4] comp_oneOUT = compositesIDX[composites.get(s_CompositeOf)[0]][0] comp_twoOUT = compositesIDX[composites.get(s_CompositeOf)[1]][0] else: # reverse order comp_one = compositesIDX[composites.get(s_CompositeOf)[1]][2] comp_two = compositesIDX[composites.get(s_CompositeOf)[0]][2] searchables_one = compositesIDX[composites.get(s_CompositeOf)[1]][4] searchables_two = compositesIDX[composites.get(s_CompositeOf)[0]][4] comp_oneOUT = compositesIDX[composites.get(s_CompositeOf)[1]][0] comp_twoOUT = compositesIDX[composites.get(s_CompositeOf)[0]][0] compOUT = 0 wildcards = [] phrases = [] for searchable_one in searchables_one: # Work out all possible combination of comp1 near comp2 c1 = searchable_one if searchable_one.find("/", 0, 1) > -1: m1 = 3 elif searchable_one.find("-") > -1: m1 = 2 elif searchable_one[:2].isupper() or len(searchable_one) < 3: m1 = 1 else: m1 = 0 for searchable_two in searchables_two: c2 = searchable_two if searchable_two.find("/", 0, 1) > -1: m2 = 3 elif searchable_two.find("-") > -1: m2 = 2 elif searchable_two[:2].isupper() or len(searchable_two) < 3: m2 = 1 else: m2 = 0 c = [c1,c2] m = [m1,m2] patterns = makeCompPattern(c, m) if m1 == 3 or m2 == 3: # One of the composites had a wildcard inside wildcards.append(patterns[0]) wildcards.append(patterns[1]) else: # No wildcards phrase1 = c1 + " " + c2 phrase2 = c2 + " " + c1 phrases.append([phrase1, patterns[0]]) phrases.append([phrase2, patterns[1]]) THIScomp = len(re.findall(patterns[0],text_string)) + len(re.findall(patterns[1],text_string)) compOUT += THIScomp if len(wildcards)>0: for wild in wildcards: for phrase in phrases: if len(re.findall(wild," " + phrase[0] + " ")) > 0: compOUT = compOUT - len(re.findall(phrase[1],text_string)) # Add extra results due to altLabels, calculated in the first part if ckwlist.get(s_CompositeOf, 0) > 0: # Add count and pop the item out of the dictionary compOUT += ckwlist.pop(s_CompositeOf) if compOUT > 0 and spires: # Output ckws in spires standard output mode (,) if store.value(rdflib.Namespace(s_CompositeOf),namespace["spiresLabel"],default=False,any=True): compositesOUT.append([compOUT, store.value(rdflib.Namespace(s_CompositeOf),namespace["spiresLabel"],default=False,any=True), comp_one, comp_two, comp_oneOUT, comp_twoOUT]) else: compositesOUT.append([compOUT, store.value(rdflib.Namespace(s_CompositeOf),namespace["prefLabel"],default=False,any=True).replace(":",","), comp_one, comp_two, comp_oneOUT, comp_twoOUT]) keys2drop.append(comp_one.strip()) keys2drop.append(comp_two.strip()) elif compOUT > 0: # Output ckws in bibclassify mode (:) compositesOUT.append([compOUT, store.value(rdflib.Namespace(s_CompositeOf),namespace["prefLabel"],default=False,any=True), comp_one, comp_two, comp_oneOUT, comp_twoOUT]) keys2drop.append(comp_one.strip()) keys2drop.append(comp_two.strip()) # Deal with ckws that only occur as altLabels ckwleft = len(ckwlist) while ckwleft > 0: compositesTOADD.append(ckwlist.popitem()) ckwleft = ckwleft - 1 for s_CompositeTOADD, compTOADD_OUT in compositesTOADD: if spires: compositesOUT.append([compTOADD_OUT, store.value(rdflib.Namespace(s_CompositeTOADD),namespace["prefLabel"],default=False,any=True).replace(":",","), "null", "null", 0, 0]) else: compositesOUT.append([compTOADD_OUT, store.value(rdflib.Namespace(s_CompositeTOADD),namespace["prefLabel"],default=False,any=True), "null", "null", 0, 0]) compositesOUT.sort() compositesOUT.reverse() # Some more keylist filtering: inclusion, e.g subtract "magnetic" if have "magnetic field" for i in keylist: pattern_to_match = " " + i[2].strip() + " " for j in keylist: test_key = " " + j[2].strip() + " " if test_key.strip() != pattern_to_match.strip() and test_key.find(pattern_to_match) > -1: keys2drop.append(pattern_to_match.strip()) text_out += "\nComposite keywords:\n" for ncomp, pref_cOf_label, comp_one, comp_two, comp_oneOUT, comp_twoOUT in compositesOUT: safe_comp_mark = " " safe_one_mark = "" safe_two_mark = "" if safe_keys.find(pref_cOf_label)>-1: safe_comp_mark = "*" if safe_keys.find(comp_one)>-1: safe_one_mark = "*" if safe_keys.find(comp_two)>-1: safe_two_mark = "*" raw.append([str(ncomp),str(pref_cOf_label)]) text_out += str(ncomp) + safe_comp_mark + " " + str(pref_cOf_label) + " [" + str(comp_oneOUT) + safe_one_mark + ", " + str(comp_twoOUT) + safe_two_mark + "]\n" if safe_comp_mark == "*": html_out.append([ncomp, str(pref_cOf_label), 1]) else: html_out.append([ncomp, str(pref_cOf_label), 0]) text_out += "\n\nSingle keywords:\n" for i in range(limit): safe_mark = " " try: idx = keys2drop.index(keylist[i][2].strip()) except: idx = -1 if safe_keys.find(keylist[i][2])>-1: safe_mark = "*" if idx == -1 and nkeywords > 0 and not keylist[i][5]: text_out += str(keylist[i][0]) + safe_mark + " " + keylist[i][2] + "\n" raw.append([keylist[i][0], keylist[i][2]]) if safe_mark == "*": html_out.append([keylist[i][0], keylist[i][2], 1]) else: html_out.append([keylist[i][0], keylist[i][2], 0]) nkeywords = nkeywords - 1 if output == 0: # Output some text return text_out elif output == 2: # return marc xml output. xml = "" for key in raw: xml += """ %s BibClassify/%s """ % (key[1],os.path.splitext(os.path.basename(ontology))[0]) return xml else: # Output some HTML html_out.sort() html_out.reverse() return make_tag_cloud(html_out) def make_tag_cloud(entries): """Using the counts for each of the tags, write a simple HTML page to standard output containing a tag cloud representation. The CSS describes ten levels, each of which has differing font-size's, line-height's and font-weight's. """ max_occurrence = int(entries[0][0]) ret = "\n" ret += "\n" ret += "Keyword Cloud\n" ret += "\n" ret += "\n" ret += "\n" ret += "\n" cloud = "" cloud_list = [] cloud += '
' # Generate some ad-hoc count distribution for i in range(0, len(entries)): count = int(entries[i][0]) tag = str(entries[i][1]) color = int(entries[i][2]) if count < (max_occurrence/10): cloud_list.append([tag,0,color]) elif count < (max_occurrence/7.5): cloud_list.append([tag,1,color]) elif count < (max_occurrence/5): cloud_list.append([tag,2,color]) elif count < (max_occurrence/4): cloud_list.append([tag,3,color]) elif count < (max_occurrence/3): cloud_list.append([tag,4,color]) elif count < (max_occurrence/2): cloud_list.append([tag,5,color]) elif count < (max_occurrence/1.7): cloud_list.append([tag,6,color]) elif count < (max_occurrence/1.5): cloud_list.append([tag,7,color]) elif count < (max_occurrence/1.3): cloud_list.append([tag,8,color]) else: cloud_list.append([tag,9,color]) cloud_list.sort() for i in range(0, len(cloud_list)): cloud += ' 0: cloud += 'style="color:red" ' cloud += '> %s ' % cloud_list[i][0] cloud += '
' ret += cloud + '\n' ret += "
\n" ret += "\n" return ret def makeCompPattern(candidates, modes): """Takes a set of two composite keywords (candidates) and compiles a REGEX expression around it, according to the chosen modes for each one: - 0 : plain case-insensitive search - 1 : plain case-sensitive search - 2 : hyphen - 3 : wildcard""" begREGEX = '(?:[^A-Za-z0-9\+-])(' endREGEX = ')(?=[^A-Za-z0-9\+-])' pattern_text = [] patterns = [] for i in range(2): if modes[i] == 0: pattern_text.append(str(re.escape(candidates[i]) + 's?')) if modes[i] == 1: pattern_text.append(str(re.escape(candidates[i]))) if modes[i] == 2: hyphen = True parts = candidates[i].split("-") pattern_string = "" for part in parts: if len(part)<1 or part.find(" ", 0, 1)> -1: # This is not really a hyphen, maybe a minus sign: treat as isupper(). hyphen = False pattern_string = pattern_string + re.escape(part) + "[- \t]?" if hyphen: pattern_text.append(pattern_string) else: pattern_text.append(re.escape(candidates[i])) if modes[i] == 3: pattern_text.append(candidates[i].replace("/","")) pattern_one = re.compile(begREGEX + pattern_text[0] + "s?[ \s,-]*" + pattern_text[1] + endREGEX, re.I) pattern_two = re.compile(begREGEX + pattern_text[1] + "s?[ \s,-]*" + pattern_text[0] + endREGEX, re.I) patterns.append(pattern_one) patterns.append(pattern_two) return patterns def makePattern(candidate, mode): """Takes a keyword (candidate) and compiles a REGEX expression around it, according to the chosen mode: - 0 : plain case-insensitive search - 1 : plain case-sensitive search - 2 : hyphen - 3 : wildcard""" # NB. At the moment, some patterns are compiled having an optional trailing "s". # This is a very basic method to find plurals in English. # If this program is to be used in other languages, please remove the "s?" from the REGEX # Also, inclusion of plurals at the ontology level would be preferred. begREGEX = '(?:[^A-Za-z0-9\+-])(' endREGEX = ')(?=[^A-Za-z0-9\+-])' try: if mode == 0: pattern = re.compile(begREGEX + re.escape(candidate) + 's?' + endREGEX, re.I) if mode == 1: pattern = re.compile(begREGEX + re.escape(candidate) + endREGEX) if mode == 2: hyphen = True parts = candidate.split("-") pattern_string = begREGEX for part in parts: if len(part)<1 or part.find(" ", 0, 1)> -1: # This is not really a hyphen, maybe a minus sign: treat as isupper(). hyphen = False pattern_string = pattern_string + re.escape(part) + "[- \t]?" pattern_string += endREGEX if hyphen: pattern = re.compile(pattern_string, re.I) else: pattern = re.compile(begREGEX + re.escape(candidate) + endREGEX, re.I) if mode == 3: pattern = re.compile(begREGEX + candidate.replace("/","") + endREGEX, re.I) except: print "Invalid thesaurus term: " + re.escape(candidate) + "
" return pattern def profile(t="", d=""): import profile import pstats profile.run("generate_keywords_rdf(textfile='%s',dictfile='%s')" % (t, d), "bibclassify_profile") p = pstats.Stats("bibclassify_profile") p.strip_dirs().sort_stats("cumulative").print_stats() return 0 def main(): """Main function """ global options long_flags =["file=", "thesaurus=","ontology=", "output=","limit=", "nkeywords=", "mode=", "spires", "help", "version"] short_flags ="f:k:K:o:l:n:m:qhVv:" spires = False limit = 70 nkeywords = 25 input_file = "" dict_file = "" output = 0 mode = 0 verbose = 0 try: opts, args = getopt.getopt(sys.argv[1:], short_flags, long_flags) except getopt.GetoptError, err: write_message(err, sys.stderr) usage(1) if args: usage(1) try: - from invenio.config import tmpdir, CFG_PATH_PDFTOTEXT, version + from invenio.config import CFG_TMPDIR, CFG_PATH_PDFTOTEXT, CFG_VERSION version_bibclassify = 0.1 - bibclassify_engine_version = "CDS Invenio/%s bibclassify/%s" % (version, version_bibclassify) + bibclassify_engine_version = "CDS Invenio/%s bibclassify/%s" % (CFG_VERSION, version_bibclassify) except: - tmpdir = TMPDIR_STANDALONE + CFG_TMPDIR = TMPDIR_STANDALONE CFG_PATH_PDFTOTEXT = PDFTOTEXT_STANDALONE - temp_text = tmpdir + '/bibclassify.pdftotext.' + str(os.getpid()) + temp_text = CFG_TMPDIR + '/bibclassify.pdftotext.' + str(os.getpid()) try: for opt in opts: if opt == ("-h","") or opt == ("--help",""): usage(1) elif opt == ("-V","") or opt == ("--version",""): print bibclassify_engine_version sys.exit(1) elif opt[0] in [ "-v", "--verbose" ]: verbose = opt[1] elif opt[0] in [ "-f", "--file" ]: if opt[1].find(".pdf")>-1: # Treat as PDF cmd = "%s " % CFG_PATH_PDFTOTEXT + opt[1] + " " + temp_text errcode = os.system(cmd) if errcode == 0 and os.path.exists("%s" % temp_text): input_file = temp_text else: print "Error while running %s.\n" % cmd sys.exit(1) else: # Treat as text input_file = opt[1] elif opt[0] in [ "-k", "--thesaurus" ]: if dict_file=="": dict_file = opt[1] else: print "Either a text thesaurus or an ontology (in .rdf format)" sys.exit(1) elif opt[0] in [ "-K", "--taxonomy" ]: if dict_file=="" and opt[1].find(".rdf")!=-1: dict_file = opt[1] else: print "Either a text thesaurus or an ontology (in .rdf format)" sys.exit(1) elif opt[0] in [ "-o", "--output" ]: try: if str(opt[1]).lower().strip() == "html": output = 1 elif str(opt[1]).lower().strip() == "text": output = 0 elif str(opt[1]).lower().strip() == "marcxml": output = 2 else: write_message('Output mode (-o) can only be "HTML", "TEXT", or "MARCXML". Using default output mode (HTML)') except: write_message('Output mode (-o) can only be "HTML", "TEXT", or "MARCXML". Using default output mode (HTML)') elif opt[0] in [ "-m", "--mode" ]: try: if str(opt[1]).lower().strip() == "partial": mode = 1 elif str(opt[1]).lower().strip() == "full": mode = 0 else: write_message('Processing mode (-m) can only be "PARTIAL" or "FULL". Using default output mode (FULL)') except: write_message('Processing mode (-m) can only be "PARTIAL" or "FULL". Using default output mode (FULL)') elif opt[0] in [ "-q", "--spires" ]: spires = True elif opt[0] in [ "-l", "--limit" ]: try: num = int(opt[1]) if num>1: limit = num else: write_message("Number of keywords for processing (--limit) must be an integer higher than 1. Using default value of 70...") except ValueError: write_message("Number of keywords for processing (-n) must be an integer. Using default value of 70...") elif opt[0] in [ "-n", "--nkeywords" ]: try: num = int(opt[1]) if num>1: nkeywords = num else: write_message("Number of keywords (--nkeywords) must be an integer higher than 1. Using default value of 25...") except ValueError: write_message("Number of keywords (--n) must be an integer. Using default value of 25...") except StandardError, e: write_message(e, sys.stderr) sys.exit(1) if input_file == "" or dict_file == "": write_message("Need to enter the name of an input file AND a thesaurus file \n") usage(1) # Weak method to detect dict_file. Need to improve this (e.g. by looking inside the metadata with rdflib?) if dict_file.find(".rdf")!=-1: outcome = generate_keywords_rdf(input_file, dict_file, output, limit, nkeywords, mode, spires, verbose, dict_file) else: # Treat as text outcome = generate_keywords(input_file, dict_file, verbose) print outcome if limit > len(outcome): limit = len(outcome) if output == 0: for i in range(limit): print outcome[i] else: print "" print "" print "Keywords" print "" print "" print '
' for i in range(limit): print "" + str(outcome[i]) + "
" print '
' print "
" print "" return if __name__ == '__main__': main() diff --git a/modules/bibconvert/lib/bibconvert.py b/modules/bibconvert/lib/bibconvert.py index 2da2a596c..bb1dd6700 100644 --- a/modules/bibconvert/lib/bibconvert.py +++ b/modules/bibconvert/lib/bibconvert.py @@ -1,2100 +1,2100 @@ ## $Id$ - + ## This file is part of CDS Invenio. ## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 CERN. ## ## CDS Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## CDS Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with CDS Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """BibConvert tool to convert bibliographic records from any format to any format.""" __revision__ = "$Id$" import fileinput import string import os import re import sys import time import getopt from time import gmtime, strftime, localtime import os.path from invenio.config import \ CFG_OAI_ID_PREFIX, \ - version,\ - etcdir + CFG_VERSION,\ + CFG_ETCDIR from invenio.search_engine import perform_request_search -CFG_BIBCONVERT_KB_PATH = "%s%sbibconvert%sKB" % (etcdir, os.sep, os.sep) +CFG_BIBCONVERT_KB_PATH = "%s%sbibconvert%sKB" % (CFG_ETCDIR, os.sep, os.sep) ### Matching records with database content def parse_query_string(query_string): """Parse query string, e.g.: Input: 245__a::REP(-, )::SHAPE::SUP(SPACE, )::MINL(4)::MAXL(8)::EXPW(PUNCT)::WORDS(4,L)::SHAPE::SUP(SPACE, )||700__a::MINL(2)::REP(COMMA,). Output:[['245__a','REP(-,)','SHAPE','SUP(SPACE, )','MINL(4)','MAXL(8)','EXPW(PUNCT)','WORDS(4,L)','SHAPE','SUP(SPACE, )'],['700__a','MINL(2)','REP(COMMA,)']] """ query_string_out = [] query_string_out_in = [] query_string_split_1 = query_string.split('||') for item_1 in query_string_split_1: query_string_split_2 = item_1.split('::') query_string_out_in = [] for item in query_string_split_2: query_string_out_in.append(item) query_string_out.append(query_string_out_in) return query_string_out def set_conv(): """ bibconvert common settings ======================= minimal length of output line = 1 maximal length of output line = 4096 """ conv_setting = [ - 1, + 1, 4096 ] return conv_setting def get_pars(fn): "Read function and its parameters into list" - + out = [] out.append(re.split('\(|\)', fn)[0]) out.append(re.split(',', re.split('\(|\)', fn)[1])) return out def get_other_par(par, cfg): "Get other parameter (par) from the configuration file (cfg)" out = "" other_parameters = { '_QRYSTR_' : '_QRYSTR_---.*$', '_MATCH_' : '_MATCH_---.*$', '_RECSEP_' : '_RECSEP_---.*$', '_EXTCFG_' : '_EXTCFG_---.*$', '_SRCTPL_' : '_SRCTPL_---.*$', '_DSTTPL_' : '_DSTTPL_---.*$', '_RECHEAD_': '_RECHEAD_---.*$', '_RECFOOT_': '_RECFOOT_---.*$', '_HEAD_' : '_HEAD_---.*$', '_FOOT_' : '_FOOT_---.*$', '_EXT_' : '_EXT_---.*$', '_SEP_' : '_SEP_---.*$', '_COD_' : '_COD_---.*$', '_FRK_' : '_FRK_---.*$', '_NC_' : '_NC_---.*$', '_MCH_' : '_MCH_---.*$', '_UPL_' : '_UPL_---.*$', '_AUTO_' : '_AUTO_---.*$' - + } - + parameters = other_parameters.keys() for line in fileinput.input(cfg): pattern = re.compile(other_parameters[par]) items = pattern.findall(line) for item in items: out = item.split('---')[1] return out def append_to_output_file(filename, output): "bibconvert output file creation by output line" try: file = open(filename, 'a') file.write(output) file.close() except IOError, e: exit_on_error("Cannot write into %s" % filename) - + return 1 - + def sub_keywd(out): "bibconvert keywords literal substitution" out = string.replace(out, "EOL", "\n") out = string.replace(out, "_CR_", "\r") out = string.replace(out, "_LF_", "\n") out = string.replace(out, "\\", '\\') out = string.replace(out, "\r", '\r') out = string.replace(out, "BSLASH", '\\') - out = string.replace(out, "COMMA", ',') + out = string.replace(out, "COMMA", ',') out = string.replace(out, "LEFTB", '[') out = string.replace(out, "RIGHTB", ']') out = string.replace(out, "LEFTP", '(') out = string.replace(out, "RIGHTP", ')') - + return out def check_split_on(data_item_split, sep, tpl_f): """ bibconvert conditional split with following conditions =================================================== ::NEXT(N,TYPE,SIDE) - next N chars are of the TYPE having the separator on the SIDE ::PREV(N,TYPE,SIDE) - prev.N chars are of the TYPE having the separator on the SIDE - """ + """ fn = get_pars(tpl_f)[0] par = get_pars(tpl_f)[1] - - + + done = 0 while (done == 0): if ( (( fn == "NEXT" ) and ( par[2]=="R" )) or (( fn == "PREV" ) and ( par[2]=="L" )) ): test_value = data_item_split[0][-(string.atoi(par[0])):] - + elif ( ((fn == "NEXT") and ( par[2]=="L")) or ((fn == "PREV") and ( par[2]=="R")) ): - + test_value = data_item_split[1][:(string.atoi(par[0]))] data_item_split_tmp = [] if ((FormatField(test_value, "SUP(" + par[1] + ",)") != "") \ or (len(test_value) < string.atoi(par[0]))): data_item_split_tmp = data_item_split[1].split(sep, 1) if(len(data_item_split_tmp)==1): done = 1 data_item_split[0] = data_item_split[0] + sep + \ data_item_split_tmp[0] data_item_split[1] = "" else: data_item_split[0] = data_item_split[0] + sep + \ data_item_split_tmp[0] data_item_split[1] = data_item_split_tmp[1] else: done = 1 return data_item_split def get_subfields(data, subfield, src_tpl): "Get subfield according to the template" out = [] for data_item in data: found = 0 for src_tpl_item in src_tpl: if (src_tpl_item[:2] == "<:"): if (src_tpl_item[2:-2] == subfield): found = 1 else: sep_in_list = src_tpl_item.split("::") sep = sep_in_list[0] - + data_item_split = data_item.split(sep, 1) if (len(data_item_split)==1): data_item = data_item_split[0] else: if (len(sep_in_list) > 1): data_item_split = check_split_on(data_item.split(sep, 1), sep_in_list[0], sep_in_list[1]) if(found == 1): data_item = data_item_split[0] else: data_item = string.join(data_item_split[1:], sep) out.append(data_item) return out def exp_n(word): "Replace newlines and carriage return's from string." out = "" - + for ch in word: if ((ch != '\n') and (ch != '\r')): out = out + ch - return out - + return out + def exp_e(list): "Expunge empty elements from a list" out = [] for item in list: item = exp_n(item) if ((item != '\r\n' and item != '\r' \ and item != '\n' and item !="" \ and len(item)!=0)): out.append(item) return out def sup_e(word): "Replace spaces" out = "" - + for ch in word: if (ch != ' '): out = out + ch - return out + return out def select_line(field_code, list): "Return appropriate item from a list" - + out = [''] for field in list: - + field[0] = sup_e(field[0]) field_code = sup_e(field_code) if (field[0] == field_code): out = field[1] return out def parse_field_definition(source_field_definition): "Create list of source_field_definition" - + word_list = [] out = [] word = "" counter = 0 - + if (len(source_field_definition.split("---"))==4): out = source_field_definition.split("---") else: element_list_high = source_field_definition.split("<:") for word_high in element_list_high: element_list_low = word_high.split(':>') for word_low in element_list_low: word_list.append(word_low) word_list.append(":>") - word_list.pop() + word_list.pop() word_list.append("<:") word_list.pop() for item in word_list: word = word + item if (item == "<:"): counter = counter + 1 if (item == ":>"): counter = counter - 1 if counter == 0: out.append(word) word = "" return out def parse_template(template): """ bibconvert parse template ====================== - in - template filename + in - template filename out - [ [ field_code , [ field_template_parsed ] , [] ] """ out = [] for field_def in read_file(template, 1): field_tpl_new = [] if ((len(field_def.split("---", 1)) > 1) and (field_def[:1] != "#")): - + field_code = field_def.split("---", 1)[0] field_tpl = parse_field_definition(field_def.split("---", 1)[1]) - + field_tpl_new = field_tpl field_tpl = exp_e(field_tpl_new) out_data = [field_code, field_tpl] out.append(out_data) - + return out def parse_common_template(template, part): """ bibconvert parse template ========================= in - template filename out - [ [ field_code , [ field_template_parsed ] , [] ] """ out = [] counter = 0 for field_def in read_file(template, 1): if (exp_n(field_def)[:3] == "==="): counter = counter + 1 - + elif (counter == part): - + field_tpl_new = [] if ((len(field_def.split("---", 1)) > 1) and (field_def[:1]!="#")): - + field_code = field_def.split("---", 1)[0] field_tpl = parse_field_definition(field_def.split("---", 1)[1]) - + field_tpl_new = field_tpl field_tpl = exp_e(field_tpl_new) out_data = [field_code, field_tpl] out.append(out_data) return out def parse_input_data_f(source_data_open, source_tpl): """ bibconvert parse input data ======================== in - input source data location (filehandle) source data template source_field_code list of source field codes source_field_data list of source field data values (repetitive fields each line one occurence) out - [ [ source_field_code , [ source_field_data ] ] , [] ] source_data_template entry - field_code---[const]<:subfield_code:>[const][<:subfield_code:>][] destination_templace entry - [::GFF()]---[const]<:field_code::subfield_code[::FF()]:>[] input data file; by line: - fieldcode value """ global separator out = [['', []]] count = 0 values = [] while (count < 1): line = source_data_open.readline() if (line == ""): return(-1) line_split = line.split(" ", 1) if (re.sub("\s", "", line) == separator): count = count + 1 if (len(line_split) == 2): field_code = line_split[0] field_value = exp_n(line_split[1]) - + values.append([field_code, field_value]) item_prev = "" stack = [''] - + for item in values: if ((item[0]==item_prev)or(item_prev == "")): stack.append(item[1]) item_prev = item[0] else: out.append([item_prev, stack]) item_prev = item[0] stack = [] stack.append(item[1]) try: if (stack[0] != ""): if (out[0][0]==""): out = [] out.append([field_code, stack]) except IndexError, e: out = out - + return out def parse_input_data_fx(source_tpl): """ bibconvert parse input data ======================== in - input source data location (filehandle) source data template source_field_code list of source field codes source_field_data list of source field data values (repetitive fields each line one occurence) out - [ [ source_field_code , [ source_field_data ] ] , [] ] - extraction_template_entry - + extraction_template_entry - input data file - specified by extract_tpl """ global separator count = 0 record = "" field_data_1_in_list = [] out = [['', []]] while (count <10): line = sys.stdin.readline() if (line == ""): count = count + 1 if (record == "" and count): return (-1) if (re.sub("\s", "", line) == separator): count = count + 10 else: record = record + line for field_defined in extract_tpl_parsed: try: field_defined[1][0] = sub_keywd(field_defined[1][0]) field_defined[1][1] = sub_keywd(field_defined[1][1]) except IndexError, e: field_defined = field_defined - + try: field_defined[1][2] = sub_keywd(field_defined[1][2]) except IndexError, e: field_defined = field_defined - + field_data_1 ="" - + if ((field_defined[1][0][0:2] == '//') and \ (field_defined[1][0][-2:] == '//')): field_defined_regexp = field_defined[1][0][2:-2] try: #### if (len(re.split(field_defined_regexp, record)) == 1): field_data_1 = "" field_data_1_in_list = [] else: field_data_1_tmp = re.split(field_defined_regexp, record, 1)[1] field_data_1_in_list = field_data_1_tmp.split(field_defined_regexp) - + except IndexError, e: field_data_1 = "" else: try: if (len(record.split(field_defined[1][0])) == 1): field_data_1 = "" field_data_1_in_list = [] else: field_data_1_tmp = record.split(field_defined[1][0], 1)[1] field_data_1_in_list = field_data_1_tmp.split(field_defined[1][0]) except IndexError, e: field_data_1 = "" - + spliton = [] outvalue = "" field_data_2 = "" field_data = "" - + try: if ((field_defined[1][1])=="EOL"): spliton = ['\n'] elif ((field_defined[1][1])=="MIN"): spliton = ['\n'] elif ((field_defined[1][1])=="MAX"): for item in extract_tpl_parsed: try: spliton.append(item[1][0]) except IndexError, e: spliton = spliton elif (field_defined[1][1][0:2] == '//') and \ (field_defined[1][1][-2:] == '//'): spliton = [field_defined[1][1][2:-2]] - + else: spliton = [field_defined[1][1]] - + except IndexError,e : spliton = "" outvalues = [] - + for field_data in field_data_1_in_list: outvalue = "" for splitstring in spliton: - + field_data_2 = "" if (len(field_data.split(splitstring))==1): if (outvalue == ""): field_data_2 = field_data else: field_data_2 = outvalue else: field_data_2 = field_data.split(splitstring)[0] - + outvalue = field_data_2 field_data = field_data_2 - + outvalues.append(outvalue) outvalues = exp_e(outvalues) if (len(outvalues) > 0): if (out[0][0]==""): out = [] outstack = [] if (len(field_defined[1])==3): - + spliton = [field_defined[1][2]] if (field_defined[1][2][0:2] == '//') and \ (field_defined[1][2][-2:] == '//'): spliton = [field_defined[1][2][2:-2]] for item in outvalues: stack = re.split(spliton[0], item) for stackitem in stack: - outstack.append(stackitem) + outstack.append(stackitem) else: outstack = outvalues - + out.append([field_defined[0], outstack]) return out def parse_input_data_d(source_data, source_tpl): """ bibconvert parse input data ======================== in - input source data location (directory) source data template source_field_code list of source field codes source_field_data list of source field data values (repetitive fields each line one occurence) out - [ [ source_field_code , [ source_field_data ] ] , [] ] source_data_template entry - field_code---[const]<:subfield_code:>[const][<:subfield_code:>][] destination_templace entry - [::GFF()]---[const]<:field_code::subfield_code[::FF()]:>[] input data dir; by file: - fieldcode value per line """ - + out = [] - + for source_field_tpl in read_file(source_tpl, 1): source_field_code = source_field_tpl.split("---")[0] source_field_data = read_file(source_data + source_field_code, 0) source_field_data = exp_e(source_field_data) - + out_data = [source_field_code, source_field_data] out.append(out_data) - + return out def sub_empty_lines(value): out = re.sub('\n\n+', '', value) return out def set_par_defaults(par1, par2): "Set default parameter when not defined" par_new_in_list = par2.split(",") i = 0 out = [] for par in par_new_in_list: - + if (len(par1)>i): if (par1[i] == ""): out.append(par) else: out.append(par1[i]) else: out.append(par) i = i + 1 return out def generate(keyword): """ bibconvert generaded values: ========================= SYSNO() - generate date as '%w%H%M%S' WEEK(N) - generate date as '%V' with shift (N) DATE(format) - generate date in specifieddate FORMAT VALUE(value) - enter value literarly OAI() - generate oai_identifier, starting value given at command line as -o """ out = keyword fn = keyword + "()" par = get_pars(fn)[1] fn = get_pars(fn)[0] - + par = set_par_defaults(par, "") - + if (fn == "SYSNO"): out = sysno500 if (fn == "SYSNO330"): out = sysno if (fn == "WEEK"): par = set_par_defaults(par, "0") out = "%02d" % (string.atoi(strftime("%V", localtime())) \ + string.atoi(par[0])) if (string.atoi(out)<0): out = "00" if (fn == "VALUE"): par = set_par_defaults(par, "") out = par[0] if (fn == "DATE"): par = set_par_defaults(par, "%w%H%M%S," + "%d" % set_conv()[1]) out = strftime(par[0], localtime()) out = out[:string.atoi(par[1])] if (fn == "XDATE"): par = set_par_defaults(par,"%w%H%M%S," + ",%d" % set_conv()[1]) out = strftime(par[0], localtime()) out = par[1] + out[:string.atoi(par[2])] if (fn == "OAI"): out = "%s:%d" % (CFG_OAI_ID_PREFIX, tcounter + oai_identifier_from) return out def read_file(filename, exception): "Read file into list" out = [] if (os.path.isfile(filename)): file = open(filename,'r') out = file.readlines() file.close() else: if exception: exit_on_error("Cannot access file: %s" % filename) return out - + def crawl_KB(filename, value, mode): """ bibconvert look-up value in KB_file in one of following modes: =========================================================== 1 - case sensitive / match (default) 2 - not case sensitive / search 3 - case sensitive / search 4 - not case sensitive / match 5 - case sensitive / search (in KB) 6 - not case sensitive / search (in KB) 7 - case sensitive / search (reciprocal) 8 - not case sensitive / search (reciprocal) 9 - replace by _DEFAULT_ only R - not case sensitive / search (reciprocal) (8) replace """ if (os.path.isfile(filename) != 1): # Look for KB in same folder as extract_tpl, if exists try: pathtmp = string.split(extract_tpl,"/") pathtmp.pop() path = string.join(pathtmp,"/") filename = path + "/" + filename except NameError: # File was not found. Try to look inside default KB # directory filename = CFG_BIBCONVERT_KB_PATH + os.sep + filename - + # FIXME: Remove \n from returned value? if (os.path.isfile(filename)): - + file_to_read = open(filename,"r") - + file_read = file_to_read.readlines() for line in file_read: code = string.split(line, "---") - + if (mode == "2"): value_to_cmp = string.lower(value) code[0] = string.lower(code[0]) if ((len(string.split(value_to_cmp, code[0])) > 1) \ or (code[0]=="_DEFAULT_")): value = code[1] return value - + elif ((mode == "3") or (mode == "0")): if ((len(string.split(value, code[0])) > 1) or \ (code[0] == "_DEFAULT_")): value = code[1] return value elif (mode == "4"): value_to_cmp = string.lower(value) code[0] = string.lower(code[0]) if ((code[0] == value_to_cmp) or \ (code[0] == "_DEFAULT_")): value = code[1] return value elif (mode == "5"): if ((len(string.split(code[0], value)) > 1) or \ (code[0] == "_DEFAULT_")): value = code[1] return value - + elif (mode == "6"): value_to_cmp = string.lower(value) code[0] = string.lower(code[0]) if ((len(string.split(code[0], value_to_cmp)) > 1) or \ (code[0] == "_DEFAULT_")): value = code[1] return value - + elif (mode == "7"): if ((len(string.split(code[0], value)) > 1) or \ (len(string.split(value,code[0])) > 1) or \ (code[0] == "_DEFAULT_")): value = code[1] return value - + elif (mode == "8"): value_to_cmp = string.lower(value) code[0] = string.lower(code[0]) if ((len(string.split(code[0], value_to_cmp)) > 1) or \ (len(string.split(value_to_cmp, code[0])) > 1) or \ (code[0] == "_DEFAULT_")): value = code[1] return value - + elif (mode == "9"): if (code[0]=="_DEFAULT_"): value = code[1] return value elif (mode == "R"): value_to_cmp = string.lower(value) code[0] = string.lower(code[0]) if ((len(string.split(code[0], value_to_cmp)) > 1) or \ (len(string.split(value_to_cmp, code[0])) > 1) or \ (code[0] == "_DEFAULT_")): value = value.replace(code[0], code[1]) else: if ((code[0] == value) or (code[0]=="_DEFAULT_")): value = code[1] return value else: sys.stderr.write("Warning: given KB could not be found. \n") return value def FormatField(value, fn): """ bibconvert formatting functions: ================================ - ADD(prefix,suffix) - add prefix/suffix - KB(kb_file,mode) - lookup in kb_file and replace value + ADD(prefix,suffix) - add prefix/suffix + KB(kb_file,mode) - lookup in kb_file and replace value ABR(N,suffix) - abbreviate to N places with suffix ABRX() - abbreviate exclusively words longer ABRW() - abbreviate word (limit from right) REP(x,y) - replace SUP(type) - remove characters of certain TYPE LIM(n,side) - limit to n letters from L/R LIMW(string,side) - L/R after split on string WORDS(n,side) - limit to n words from L/R IF(value,valueT,valueF) - replace on IF condition MINL(n) - replace words shorter than n MINLW(n) - replace words shorter than n MAXL(n) - replace words longer than n EXPW(type) - replace word from value containing TYPE EXP(STR,0/1) - replace word from value containing string NUM() - take only digits in given string SHAPE() - remove extra space UP() - to uppercase DOWN() - to lowercase CAP() - make capitals each word SPLIT(n,h,str,from) - only for final Aleph field, i.e. AB , maintain whole words SPLITW(sep,h,str,from) - only for final Aleph field, split on string CONF(filed,value,0/1) - confirm validity of output line (check other field) CONFL(substr,0/1) - confirm validity of output line (check field being processed) CUT(prefix,postfix) - remove substring from side RANGE(MIN,MAX) - select items in repetitive fields RE(regexp) - regular expressions IFDEFP(field,value,0/1) - confirm validity of output line (check other field) NOTE: This function works for CONSTANT lines - those without any variable values in them. JOINMULTILINES(prefix,suffix) - Given a field-value with newlines in it, split the field on the new lines (\n), separating them with prefix, then suffix. E.g.: For the field XX with the value: Test Case, A And the function call: <:XX^::XX::JOINMULTILINES(,):> The results would be: TestCase, A One note on this: <:XX^::XX: Without the ^ the newlines will be lost as bibconvert will remove them, so you'll never see an effect from this function. - - + + bibconvert character TYPES ========================== ALPHA - alphabetic NALPHA - not alpphabetic NUM - numeric NNUM - not numeric ALNUM - alphanumeric NALNUM - non alphanumeric LOWER - lowercase UPPER - uppercase PUNCT - punctual NPUNCT - non punctual SPACE - space """ global data_parsed out = value fn = fn + "()" par = get_pars(fn)[1] fn = get_pars(fn)[0] regexp = "//" NRE = len(regexp) value = sub_keywd(value) par_tmp = [] for item in par: item = sub_keywd(item) par_tmp.append(item) - par = par_tmp - + par = par_tmp + if (fn == "RE"): new_value = "" par = set_par_defaults(par,".*,0") if (re.search(par[0], value) and (par[1] == "0")): new_value = value out = new_value - + if (fn == "KB"): new_value = "" - + par = set_par_defaults(par, "KB,0") new_value = crawl_KB(par[0], value, par[1]) out = new_value elif (fn == "ADD"): - + par = set_par_defaults(par, ",") out = par[0] + value + par[1] - + elif (fn == "ABR"): - par = set_par_defaults(par, "1,.") + par = set_par_defaults(par, "1,.") out = value[:string.atoi(par[0])] + par[1] elif (fn == "ABRW"): tmp = FormatField(value, "ABR(1,.)") tmp = tmp.upper() out = tmp elif (fn == "ABRX"): - par = set_par_defaults(par, ",") - toout = [] + par = set_par_defaults(par, ",") + toout = [] tmp = value.split(" ") for wrd in tmp: if (len(wrd) > string.atoi(par[0])): wrd = wrd[:string.atoi(par[0])] + par[1] toout.append(wrd) out = string.join(toout, " ") elif (fn == "SUP"): par = set_par_defaults(par, ",") if(par[0]=="NUM"): out = re.sub('\d+', par[1], value) - + if(par[0]=="NNUM"): out = re.sub('\D+', par[1], value) if(par[0]=="ALPHA"): out = re.sub('[a-zA-Z]+', par[1], value) if(par[0]=="NALPHA"): out = re.sub('[^a-zA-Z]+', par[1], value) if((par[0]=="ALNUM") or (par[0] == "NPUNCT")): out = re.sub('\w+', par[1], value) if(par[0]=="NALNUM"): out = re.sub('\W+', par[1], value) if(par[0]=="PUNCT"): out = re.sub('\W+', par[1], value) - + if(par[0]=="LOWER"): out = re.sub('[a-z]+', par[1], value) if(par[0]=="UPPER"): out = re.sub('[A-Z]+', par[1], value) if(par[0]=="SPACE"): out = re.sub('\s+', par[1], value) - + elif (fn == "LIM"): - par = set_par_defaults(par,",") + par = set_par_defaults(par,",") if (par[1] == "L"): - out = value[(len(value) - string.atoi(par[0])):] + out = value[(len(value) - string.atoi(par[0])):] if (par[1] == "R"): out = value[:string.atoi(par[0])] elif (fn == "LIMW"): - par = set_par_defaults(par,",") + par = set_par_defaults(par,",") if (par[0]!= ""): if (par[0][0:NRE] == regexp and par[0][-NRE:] == regexp): par[0] = par[0][NRE:-NRE] par[0] = re.search(par[0], value).group() tmp = value.split(par[0]) if (par[1] == "L"): out = par[0] + tmp[1] if (par[1] == "R"): out = tmp[0] + par[0] elif (fn == "WORDS"): tmp2 = [value] - par = set_par_defaults(par, ",") + par = set_par_defaults(par, ",") if (par[1] == "R"): tmp = value.split(" ") - tmp2 = [] + tmp2 = [] i = 0 while (i < string.atoi(par[0])): tmp2.append(tmp[i]) i = i + 1 if (par[1] == "L"): tmp = value.split(" ") tmp.reverse() tmp2 = [] i = 0 while (i < string.atoi(par[0])): tmp2.append(tmp[i]) i = i + 1 tmp2.reverse() out = string.join(tmp2, " ") elif (fn == "MINL"): - - par = set_par_defaults(par, "1") + + par = set_par_defaults(par, "1") tmp = value.split(" ") tmp2 = [] i = 0 for wrd in tmp: if (len(wrd) >= string.atoi(par[0])): tmp2.append(wrd) out = string.join(tmp2, " ") elif (fn == "MINLW"): - par = set_par_defaults(par, "1") + par = set_par_defaults(par, "1") if (len(value) >= string.atoi(par[0])): out = value else: out = "" elif (fn == "MAXL"): - par = set_par_defaults(par, "4096") + par = set_par_defaults(par, "4096") tmp = value.split(" ") tmp2 = [] i = 0 for wrd in tmp: if (len(wrd) <= string.atoi(par[0])): tmp2.append(wrd) out = string.join(tmp2, " ") - + elif (fn == "REP"): set_par_defaults(par, ",") if (par[0]!= ""): if (par[0][0:NRE] == regexp and par[0][-NRE:] == regexp): par[0] = par[0][NRE:-NRE] out = re.sub(par[0], value) else: out = value.replace(par[0], par[1]) elif (fn == "SHAPE"): - + if (value != ""): out = value.strip() elif (fn == "UP"): out = value.upper() elif (fn == "DOWN"): out = value.lower() elif (fn == "CAP"): tmp = value.split(" ") out2 = [] for wrd in tmp: wrd2 = wrd.capitalize() out2.append(wrd2) out = string.join(out2, " ") elif (fn == "IF"): par = set_par_defaults(par, ",,") N = 0 while N < 3: if (par[N][0:NRE] == regexp and par[N][-NRE:] == regexp): par[N] = par[N][NRE:-NRE] par[N] = re.search(par[N], value).group() N += 1 if (value == par[0]): out = par[1] else: out = par[2] if (out == "ORIG"): out = value elif (fn == "EXP"): par = set_par_defaults(par, ",0") if (par[0][0:NRE] == regexp and par[0][-NRE:] == regexp): par[0] = par[0][NRE:-NRE] par[0] = re.search(par[0], value).group() - + tmp = value.split(" ") out2 = [] for wrd in tmp: if (par[0][0:NRE] == regexp and par[0][-NRE:] == regexp): par[0] = par[0][NRE:-NRE] if ((re.search(par[0], wrd).group() == wrd) and \ (par[1] == "1")): out2.append(wrd) if ((re.search(par[0], wrd).group() != wrd) and \ (par[1] == "0")): out2.append(wrd) else: if ((len(wrd.split(par[0])) == 1) and \ (par[1] == "1")): out2.append(wrd) if ((len(wrd.split(par[0])) != 1) and \ (par[1] == "0")): - out2.append(wrd) + out2.append(wrd) out = string.join(out2," ") elif (fn == "EXPW"): par = set_par_defaults(par,",0") tmp = value.split(" ") out2 = [] for wrd in tmp: if ((FormatField(wrd,"SUP(" + par[0] + ")") == wrd) and \ (par[1] == "1")): out2.append(wrd) if ((FormatField(wrd,"SUP(" + par[0] + ")") != wrd) and \ (par[1] == "0")): out2.append(wrd) - + out = string.join(out2," ") - + elif fn == "JOINMULTILINES": ## Take a string, split it on newlines, and join them together, with ## a prefix and suffix for each segment. If prefix and suffix are ## empty strings, make suffix a single space. prefix = par[0] suffix = par[1] if prefix == "" and suffix == "": ## Values should at least be separated by something; ## make suffix a space: suffix = " " new_value = "" vals_list = value.split("\n") for item in vals_list: new_value += "%s%s%s" % (prefix, item, suffix) new_value.rstrip(" ") ## Update "out" with the newly created value: out = new_value elif (fn == "SPLIT"): par = set_par_defaults(par, "%d,0,,1" % conv_setting[1]) length = string.atoi(par[0]) + (string.atoi(par[1])) header = string.atoi(par[1]) headerplus = par[2] starting = string.atoi(par[3]) line = "" tmp2 = [] tmp3 = [] tmp = value.split(" ") linenumber = 1 if (linenumber >= starting): tmp2.append(headerplus) line = line + headerplus - + for wrd in tmp: line = line + " " + wrd tmp2.append(wrd) if (len(line) > length): linenumber = linenumber + 1 line = tmp2.pop() toout = string.join(tmp2) tmp3.append(toout) tmp2 = [] line2 = value[:header] if (linenumber >= starting): line3 = line2 + headerplus + line else: line3 = line2 + line - line = line3 - tmp2.append(line) + line = line3 + tmp2.append(line) tmp3.append(line) out = string.join(tmp3, "\n") out = FormatField(out, "SHAPE()") elif (fn == "SPLITW"): par = set_par_defaults(par, ",0,,1") if (par[0][0:NRE] == regexp and par[0][-NRE:] == regexp): par[0] = par[0][NRE:-NRE] str = re.search(par[0], value) header = string.atoi(par[1]) headerplus = par[2] starting = string.atoi(par[3]) counter = 1 - + tmp2 = [] tmp = re.split(par[0], value) last = tmp.pop() - + for wrd in tmp: counter = counter + 1 if (counter >= starting): tmp2.append(value[:header] + headerplus + wrd + str) else: tmp2.append(value[:header] + wrd + str) if (last != ""): counter = counter + 1 if (counter >= starting): tmp2.append(value[:header] + headerplus + last) else: tmp2.append(value[:header] + last) - + out = string.join(tmp2,"\n") elif (fn == "CONF"): par = set_par_defaults(par, ",,1") found = 0 par1 = "" data = select_line(par[0], data_parsed) - + for line in data: if (par[1][0:NRE] == regexp and par[1][-NRE:] == regexp): par1 = par[1][NRE:-NRE] else: par1 = par[1] if (par1 == ""): if (line == ""): found = 1 elif (len(re.split(par1,line)) > 1 ): found = 1 if ((found == 1) and (string.atoi(par[2]) == 1)): out = value if ((found == 1) and (string.atoi(par[2]) == 0)): out = "" if ((found == 0) and (string.atoi(par[2]) == 1)): out = "" if ((found == 0) and (string.atoi(par[2]) == 0)): out = value return out elif (fn == "IFDEFP"): par = set_par_defaults(par, ",,1") found = 0 par1 = "" data = select_line(par[0], data_parsed) if len(data) == 0 and par[1] == "": ## The "found" condition is that the field was empty found = 1 else: ## Seeking a value in the field - conduct the search: for line in data: if (par[1][0:NRE] == regexp and par[1][-NRE:] == regexp): par1 = par[1][NRE:-NRE] else: par1 = par[1] if (par1 == ""): if (line == ""): found = 1 elif (len(re.split(par1,line)) > 1 ): found = 1 if ((found == 1) and (string.atoi(par[2]) == 1)): out = value if ((found == 1) and (string.atoi(par[2]) == 0)): out = "" if ((found == 0) and (string.atoi(par[2]) == 1)): out = "" if ((found == 0) and (string.atoi(par[2]) == 0)): out = value return out elif (fn == "CONFL"): set_par_defaults(par,",1") if (par[0][0:NRE] == regexp and par[0][-NRE:] == regexp): par[0] = par[0][NRE:-NRE] if (re.search(par[0], value)): - if (string.atoi(par[1]) == 1): + if (string.atoi(par[1]) == 1): out = value else: out = "" else: - if (string.atoi(par[1]) == 1): + if (string.atoi(par[1]) == 1): out = "" else: out = value return out elif (fn == "CUT"): par = set_par_defaults(par, ",") left = value[:len(par[0])] right = value[-(len(par[1])):] if (left == par[0]): out = out[len(par[0]):] if (right == par[1]): out = out[:-(len(par[1]))] - + return out elif (fn == "NUM"): tmp = re.findall('\d', value) out = string.join(tmp, "") return out def format_field(value, fn): """ bibconvert formatting functions: ================================ - ADD(prefix,suffix) - add prefix/suffix - KB(kb_file,mode) - lookup in kb_file and replace value + ADD(prefix,suffix) - add prefix/suffix + KB(kb_file,mode) - lookup in kb_file and replace value ABR(N,suffix) - abbreviate to N places with suffix ABRX() - abbreviate exclusively words longer ABRW() - abbreviate word (limit from right) REP(x,y) - replace SUP(type) - remove characters of certain TYPE LIM(n,side) - limit to n letters from L/R LIMW(string,side) - L/R after split on string WORDS(n,side) - limit to n words from L/R IF(value,valueT,valueF) - replace on IF condition MINL(n) - replace words shorter than n MINLW(n) - replace words shorter than n MAXL(n) - replace words longer than n EXPW(type) - replace word from value containing TYPE EXP(STR,0/1) - replace word from value containing string NUM() - take only digits in given string SHAPE() - remove extra space UP() - to uppercase DOWN() - to lowercase CAP() - make capitals each word SPLIT(n,h,str,from) - only for final Aleph field, i.e. AB , maintain whole words SPLITW(sep,h,str,from) - only for final Aleph field, split on string CONF(filed,value,0/1) - confirm validity of output line (check other field) CONFL(substr,0/1) - confirm validity of output line (check field being processed) CUT(prefix,postfix) - remove substring from side RANGE(MIN,MAX) - select items in repetitive fields RE(regexp) - regular expressions - + bibconvert character TYPES ========================== ALPHA - alphabetic NALPHA - not alpphabetic NUM - numeric NNUM - not numeric ALNUM - alphanumeric NALNUM - non alphanumeric LOWER - lowercase UPPER - uppercase PUNCT - punctual NPUNCT - non punctual SPACE - space """ global data_parsed out = value fn = fn + "()" par = get_pars(fn)[1] fn = get_pars(fn)[0] regexp = "//" NRE = len(regexp) value = sub_keywd(value) par_tmp = [] for item in par: item = sub_keywd(item) par_tmp.append(item) - par = par_tmp - + par = par_tmp + if (fn == "RE"): new_value = "" par = set_par_defaults(par, ".*,0") if (re.search(par[0], value) and (par[1] == "0")): new_value = value out = new_value - + if (fn == "KB"): new_value = "" - + par = set_par_defaults(par, "KB,0") new_value = crawl_KB(par[0], value, par[1]) out = new_value elif (fn == "ADD"): - + par = set_par_defaults(par, ",") out = par[0] + value + par[1] - + elif (fn == "ABR"): - par = set_par_defaults(par, "1,.") + par = set_par_defaults(par, "1,.") out = value[:string.atoi(par[0])] + par[1] elif (fn == "ABRW"): tmp = format_field(value,"ABR(1,.)") tmp = tmp.upper() out = tmp elif (fn == "ABRX"): - par = set_par_defaults(par, ",") - toout = [] + par = set_par_defaults(par, ",") + toout = [] tmp = value.split(" ") for wrd in tmp: if (len(wrd) > string.atoi(par[0])): wrd = wrd[:string.atoi(par[0])] + par[1] toout.append(wrd) out = string.join(toout, " ") elif (fn == "SUP"): par = set_par_defaults(par, ",") if(par[0] == "NUM"): out = re.sub('\d+', par[1], value) - + if(par[0] == "NNUM"): out = re.sub('\D+', par[1], value) if(par[0] == "ALPHA"): out = re.sub('[a-zA-Z]+', par[1], value) if(par[0] == "NALPHA"): out = re.sub('[^a-zA-Z]+', par[1], value) if((par[0] == "ALNUM") or (par[0] == "NPUNCT")): out = re.sub('\w+', par[1], value) if(par[0] == "NALNUM"): out = re.sub('\W+', par[1], value) if(par[0] == "PUNCT"): out = re.sub('\W+', par[1], value) - + if(par[0] == "LOWER"): out = re.sub('[a-z]+', par[1], value) if(par[0] == "UPPER"): out = re.sub('[A-Z]+', par[1], value) if(par[0] == "SPACE"): out = re.sub('\s+', par[1], value) - + elif (fn == "LIM"): - par = set_par_defaults(par, ",") + par = set_par_defaults(par, ",") if (par[1] == "L"): - out = value[(len(value) - string.atoi(par[0])):] + out = value[(len(value) - string.atoi(par[0])):] if (par[1] == "R"): out = value[:string.atoi(par[0])] elif (fn == "LIMW"): - par = set_par_defaults(par, ",") + par = set_par_defaults(par, ",") if (par[0]!= ""): if (par[0][0:NRE] == regexp and par[0][-NRE:] == regexp): par[0] = par[0][NRE:-NRE] par[0] = re.search(par[0], value).group() tmp = value.split(par[0]) if (par[1] == "L"): out = par[0] + tmp[1] if (par[1] == "R"): out = tmp[0] + par[0] elif (fn == "WORDS"): tmp2 = [value] - par = set_par_defaults(par, ",") + par = set_par_defaults(par, ",") if (par[1] == "R"): tmp = value.split(" ") - tmp2 = [] + tmp2 = [] i = 0 while (i < string.atoi(par[0])): tmp2.append(tmp[i]) i = i + 1 if (par[1] == "L"): tmp = value.split(" ") tmp.reverse() tmp2 = [] i = 0 while (i < string.atoi(par[0])): tmp2.append(tmp[i]) i = i + 1 tmp2.reverse() out = string.join(tmp2, " ") elif (fn == "MINL"): - - par = set_par_defaults(par, "1") + + par = set_par_defaults(par, "1") tmp = value.split(" ") tmp2 = [] i = 0 for wrd in tmp: if (len(wrd) >= string.atoi(par[0])): tmp2.append(wrd) out = string.join(tmp2, " ") elif (fn == "MINLW"): - par = set_par_defaults(par, "1") + par = set_par_defaults(par, "1") if (len(value) >= string.atoi(par[0])): out = value else: out = "" elif (fn == "MAXL"): - par = set_par_defaults(par, "4096") + par = set_par_defaults(par, "4096") tmp = value.split(" ") tmp2 = [] i = 0 for wrd in tmp: if (len(wrd) <= string.atoi(par[0])): tmp2.append(wrd) out = string.join(tmp2, " ") - + elif (fn == "REP"): set_par_defaults(par, ",") if (par[0]!= ""): if (par[0][0:NRE] == regexp and par[0][-NRE:] == regexp): par[0] = par[0][NRE:-NRE] out = re.sub(par[0], value) else: out = value.replace(par[0], par[1]) elif (fn == "SHAPE"): - + if (value != ""): out = value.strip() elif (fn == "UP"): out = value.upper() elif (fn == "DOWN"): out = value.lower() elif (fn == "CAP"): tmp = value.split(" ") out2 = [] for wrd in tmp: wrd2 = wrd.capitalize() out2.append(wrd2) out = string.join(out2," ") elif (fn == "IF"): par = set_par_defaults(par,",,") N = 0 while N < 3: if (par[N][0:NRE] == regexp and par[N][-NRE:] == regexp): par[N] = par[N][NRE:-NRE] par[N] = re.search(par[N], value).group() N += 1 if (value == par[0]): out = par[1] else: out = par[2] if (out == "ORIG"): out = value elif (fn == "EXP"): par = set_par_defaults(par, ",0") if (par[0][0:NRE] == regexp and par[0][-NRE:] == regexp): par[0] = par[0][NRE:-NRE] par[0] = re.search(par[0], value).group() - + tmp = value.split(" ") out2 = [] for wrd in tmp: if (par[0][0:NRE] == regexp and par[0][-NRE:] == regexp): par[0] = par[0][NRE:-NRE] if ((re.search(par[0], wrd).group() == wrd) and \ (par[1] == "1")): out2.append(wrd) if ((re.search(par[0], wrd).group() != wrd) and \ (par[1] == "0")): out2.append(wrd) else: if ((len(wrd.split(par[0])) == 1) and \ (par[1] == "1")): out2.append(wrd) if ((len(wrd.split(par[0])) != 1) and \ (par[1] == "0")): - out2.append(wrd) + out2.append(wrd) out = string.join(out2," ") elif (fn == "EXPW"): par = set_par_defaults(par,",0") tmp = value.split(" ") out2 = [] for wrd in tmp: if ((format_field(wrd,"SUP(" + par[0] + ")") == wrd) and \ (par[1] == "1")): out2.append(wrd) if ((format_field(wrd,"SUP(" + par[0] + ")") != wrd) and \ (par[1] == "0")): out2.append(wrd) - + out = string.join(out2," ") - + elif (fn == "SPLIT"): par = set_par_defaults(par, "%d,0,,1" % conv_setting[1]) length = string.atoi(par[0]) + (string.atoi(par[1])) header = string.atoi(par[1]) headerplus = par[2] starting = string.atoi(par[3]) line = "" tmp2 = [] tmp3 = [] tmp = value.split(" ") linenumber = 1 if (linenumber >= starting): tmp2.append(headerplus) line = line + headerplus - + for wrd in tmp: line = line + " " + wrd tmp2.append(wrd) if (len(line) > length): linenumber = linenumber + 1 line = tmp2.pop() toout = string.join(tmp2) tmp3.append(toout) tmp2 = [] line2 = value[:header] if (linenumber >= starting): line3 = line2 + headerplus + line else: line3 = line2 + line - line = line3 - tmp2.append(line) + line = line3 + tmp2.append(line) tmp3.append(line) out = string.join(tmp3, "\n") out = format_field(out, "SHAPE()") elif (fn == "SPLITW"): par = set_par_defaults(par, ",0,,1") if (par[0][0:NRE] == regexp and par[0][-NRE:] == regexp): par[0] = par[0][NRE:-NRE] str = re.search(par[0], value) header = string.atoi(par[1]) headerplus = par[2] starting = string.atoi(par[3]) counter = 1 - + tmp2 = [] tmp = re.split(par[0], value) last = tmp.pop() - + for wrd in tmp: counter = counter + 1 if (counter >= starting): tmp2.append(value[:header] + headerplus + wrd + str) else: tmp2.append(value[:header] + wrd + str) if (last != ""): counter = counter + 1 if (counter >= starting): tmp2.append(value[:header] + headerplus + last) else: tmp2.append(value[:header] + last) - + out = string.join(tmp2, "\n") elif (fn == "CONF"): par = set_par_defaults(par, ",,1") found = 0 par1 = "" data = select_line(par[0], data_parsed) - + for line in data: if (par[1][0:NRE] == regexp and par[1][-NRE:] == regexp): par1 = par[1][NRE:-NRE] else: par1 = par[1] if (par1 == ""): if (line == ""): found = 1 elif (len(re.split(par1,line)) > 1 ): found = 1 if ((found == 1) and (string.atoi(par[2]) == 1)): out = value if ((found == 1) and (string.atoi(par[2]) == 0)): out = "" if ((found == 0) and (string.atoi(par[2]) == 1)): out = "" if ((found == 0) and (string.atoi(par[2]) == 0)): out = value - + return out - + elif (fn == "CONFL"): set_par_defaults(par,",1") if (par[0][0:NRE] == regexp and par[0][-NRE:] == regexp): par[0] = par[0][NRE:-NRE] if (re.search(par[0], value)): - if (string.atoi(par[1]) == 1): + if (string.atoi(par[1]) == 1): out = value else: out = "" else: - if (string.atoi(par[1]) == 1): + if (string.atoi(par[1]) == 1): out = "" else: out = value return out elif (fn == "CUT"): par = set_par_defaults(par, ",") left = value[:len(par[0])] right = value[-(len(par[1])):] if (left == par[0]): out = out[len(par[0]):] if (right == par[1]): out = out[:-(len(par[1]))] - + return out elif (fn == "NUM"): tmp = re.findall('\d', value) out = string.join(tmp, "") return out ## Match records with the database content ## def match_in_database(record, query_string): "Check if record is in alreadey in database with an oai identifier. Returns recID if present, 0 otherwise." query_string_parsed = parse_query_string(query_string) search_pattern = [] search_field = [] for query_field in query_string_parsed: ind1 = query_field[0][3:4] if ind1 == "_": ind1 = "" ind2 = query_field[0][4:5] if ind2 == "_": ind2 = "" stringsplit = "" % (query_field[0][0:3], ind1, ind2, query_field[0][5:6]) formatting = query_field[1:] record1 = string.split(record, stringsplit) - + if len(record1) > 1: - + matching_value = string.split(record1[1], "<")[0] for fn in formatting: matching_value = FormatField(matching_value, fn) search_pattern.append(matching_value) search_field.append(query_field[0]) search_field.append("") search_field.append("") search_field.append("") search_pattern.append("") search_pattern.append("") search_pattern.append("") recID_list = perform_request_search(p1=search_pattern[0], f1=search_field[0], p2=search_pattern[1], f2=search_field[1], p3=search_pattern[2], f3=search_field[2]) return recID_list def parse_query_string(query_string): """Parse query string, e.g.: Input: 245__a::REP(-, )::SHAPE::SUP(SPACE, )::MINL(4)::MAXL(8)::EXPW(PUNCT)::WORDS(4,L)::SHAPE::SUP(SPACE, )||700__a::MINL(2)::REP(COMMA,). Output:[['245__a','REP(-,)','SHAPE','SUP(SPACE, )','MINL(4)','MAXL(8)','EXPW(PUNCT)','WORDS(4,L)','SHAPE','SUP(SPACE, )'],['700__a','MINL(2)','REP(COMMA,)']] """ query_string_out = [] query_string_out_in = [] query_string_split_1 = query_string.split('||') for item_1 in query_string_split_1: query_string_split_2 = item_1.split('::') query_string_out_in = [] for item in query_string_split_2: query_string_out_in.append(item) query_string_out.append(query_string_out_in) return query_string_out def exit_on_error(error_message): "exit when error occured" sys.stderr.write("\n bibconvert data convertor\n") sys.stderr.write(" Error: %s\n" % error_message) sys.exit() return 0 def create_record(begin_record_header, ending_record_footer, query_string, match_mode, Xcount): "Create output record" global data_parsed out_to_print = "" out = [] field_data_item_LIST = [] ssn5cnt = "%3d" % Xcount sysno = generate("DATE(%w%H%M%S)") sysno500 = generate("XDATE(%w%H%M%S)," + ssn5cnt) - + for T_tpl_item_LIST in target_tpl_parsed: # the line is printed only if the variables inside are not empty print_line = 0 to_output = [] - rows = 1 + rows = 1 for field_tpl_item_STRING in T_tpl_item_LIST[1]: save_field_newlines = 0 DATA = [] if (field_tpl_item_STRING[:2]=="<:"): field_tpl_item_STRING = field_tpl_item_STRING[2:-2] field = field_tpl_item_STRING.split("::")[0] if (len(field_tpl_item_STRING.split("::")) == 1): value = generate(field) to_output.append([value]) else: subfield = field_tpl_item_STRING.split("::")[1] if (field[-1] == "*"): repetitive = 1 field = field[:-1] elif field[-1] == "^": ## Keep the newlines in a field's value: repetitive = 0 save_field_newlines = 1 field = field[:-1] else: repetitive = 0 if dirmode: DATA = select_line(field, data_parsed) else: DATA = select_line(field, data_parsed) if save_field_newlines == 1: ## put newlines back into the element value: DATA = [string.join(DATA, "\n")] elif (repetitive == 0): DATA = [string.join(DATA, " ")] SRC_TPL = select_line(field, source_tpl_parsed) try: ## Get the components that this field is composed of: field_components = field_tpl_item_STRING.split("::") num_field_components = len(field_components) ## Test the number of components. If it is greater that 2, ## some kind of functions must be called on the value of ## the field, and it should therefore be evaluated. If however, ## the field is made-up of only 2 components, (i.e. no functions ## are called on its value, AND the value is empty, do not bother ## to evaluate it. ## ## E.g. In the following line: ## 300---<:Num::Num:><:Num::Num::IF(,mult. p):> ## ## If we have a value "3" for page number (Num), we want the following result: ## 3 p ## If however, we have no value for page number (Num), we want this result: ## mult. p ## The functions relating to the datafield must therefore be executed ## ## If however, the template contains this line: ## 300---<:Num::Num:> ## ## If we have a value "3" for page number (Num), we want the following result: ## 3 ## If however, we have no value for page number (Num), we do NOT want the line ## to be printed at all - we should SKIP the element and not return an empty ## value ( would be pointless.) if (DATA[0] != "" or num_field_components > 2): DATA = get_subfields(DATA, subfield, SRC_TPL) FF = field_tpl_item_STRING.split("::") if (len(FF) > 2): FF = FF[2:] for fn in FF: # DATAFORMATTED = [] if (len(DATA) != 0): DATA = get_subfields(DATA, subfield, SRC_TPL) FF = field_tpl_item_STRING.split("::") if (len(FF) > 2): FF = FF[2:] for fn2 in FF: DATAFORMATTED = [] for item in DATA: item = FormatField(item, fn) if item != "": DATAFORMATTED.append(item) DATA = DATAFORMATTED if (len(DATA) > rows): rows = len(DATA) if DATA[0] != "": print_line = 1 to_output.append(DATA) except IndexError, e: pass else: to_output.append([field_tpl_item_STRING]) current = 0 default_print = 0 while (current < rows): line_to_print = [] for item in to_output: if (item == []): item = [''] if (len(item) <= current): printout = item[0] else: printout = item[current] line_to_print.append(printout) output = exp_n(string.join(line_to_print,"")) global_formatting_functions = T_tpl_item_LIST[0].split("::")[1:] for GFF in global_formatting_functions: if (GFF[:5] == "RANGE"): parR = get_pars(GFF)[1] parR = set_par_defaults(parR,"MIN,MAX") if (parR[0]!="MIN"): if (string.atoi(parR[0]) > (current+1)): output = "" if (parR[1]!="MAX"): if (string.atoi(parR[1]) < (current+1)): output = "" elif (GFF[:6] == "IFDEFP"): ## Like a DEFP and a CONF combined. I.e. Print the line ## EVEN if its a constant, but ONLY IF the condition in ## the IFDEFP is met. ## If the value returned is an empty string, no line will ## be printed. output = FormatField(output, GFF) print_line = 1 elif (GFF[:4] == "DEFP"): default_print = 1 else: output = FormatField(output, GFF) if ((len(output) > set_conv()[0] and print_line == 1) or default_print): out_to_print = out_to_print + output + "\n" current = current + 1 ### out_flag = 0 if query_string: recID = match_in_database(out_to_print, query_string) - + if len(recID) == 1 and match_mode == 1: ctrlfield = "%d" % (recID[0]) out_to_print = ctrlfield + "\n" + out_to_print out_flag = 1 - + if len(recID) == 0 and match_mode == 0: out_flag = 1 - + if len(recID) > 1 and match_mode == 2: out_flag = 1 - - + + if out_flag or match_mode == -1: if begin_record_header != "": out_to_print = begin_record_header + "\n" + out_to_print if ending_record_footer != "": out_to_print = out_to_print + "\n" + ending_record_footer else: out_to_print = "" - + return out_to_print def convert(ar_): global dirmode, Xcount, conv_setting, sysno, sysno500, separator, tcounter, source_data, query_string, match_mode, begin_record_header, ending_record_footer, output_rec_sep, begin_header, ending_footer, oai_identifier_from, source_tpl, source_tpl_parsed, target_tpl, target_tpl_parsed, extract_tpl, extract_tpl_parsed, data_parsed dirmode, Xcount, conv_setting, sysno, sysno500, separator, tcounter, source_data, query_string, match_mode, begin_record_header, ending_record_footer, output_rec_sep, begin_header, ending_footer, oai_identifier_from, source_tpl, source_tpl_parsed, target_tpl, target_tpl_parsed, extract_tpl, extract_tpl_parsed = ar_ # separator = spt # Added by Alberto separator = sub_keywd(separator) - + if dirmode: if (os.path.isdir(source_data)): data_parsed = parse_input_data_d(source_data, source_tpl) - + record = create_record(begin_record_header, ending_record_footer, query_string, match_mode, Xcount) if record != "": print record tcounter = tcounter + 1 if output_rec_sep != "": print output_rec_sep else: exit_on_error("Cannot access directory: %s" % source_data) - + else: done = 0 print begin_header while (done == 0): data_parsed = parse_input_data_fx(source_tpl) if (data_parsed == -1): done = 1 else: if (data_parsed[0][0]!= ''): record = create_record(begin_record_header, ending_record_footer, query_string, match_mode, Xcount) Xcount += 1 if record != "": print record tcounter = tcounter + 1 if output_rec_sep != "": print output_rec_sep print ending_footer return diff --git a/modules/bibconvert/lib/bibconvert_bfx_engine.py b/modules/bibconvert/lib/bibconvert_bfx_engine.py index f2aa705d0..39d7d6f38 100644 --- a/modules/bibconvert/lib/bibconvert_bfx_engine.py +++ b/modules/bibconvert/lib/bibconvert_bfx_engine.py @@ -1,303 +1,303 @@ # -*- coding: utf-8 -*- ## ## $Id$ ## ## This file is part of CDS Invenio. ## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 CERN. ## ## CDS Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## CDS Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -## General Public License for more details. +## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with CDS Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """ bibconvert_bfx_engine - XML processing library for CDS Invenio using bfx stylesheets. Does almost what an XSLT processor does, but using a special syntax for the transformation stylesheet: a combination of 'BibFormat for XML' (bibformat bfx) templates and XPath is used. Dependencies: bibformat_bfx_engine.py Used by: bibconvert.in """ __revision__ = "$Id$" import sys import os from cStringIO import StringIO processor_type = -1 try: # Try to load from xml.xpath import Evaluate from xml.dom import minidom, Node from xml.xpath.Context import Context processor_type = 0 except ImportError: pass -# TODO: Try to explicitely load 4suite Xpath +# TODO: Try to explicitely load 4suite Xpath # # From : ## 1. PyXML usage (do not use with 4Suite) ## * import xml.xslt ## * import xml.xpath ## 2. 4Suite usage (use these imports) ## * import Ft.Xml.XPath ## * import Ft.Xml.Xslt - + from invenio import bibformat_bfx_engine -from invenio.config import etcdir +from invenio.config import CFG_ETCDIR -CFG_BFX_TEMPLATES_PATH = "%s%sbibconvert%sconfig" % (etcdir, os.sep, os.sep) +CFG_BFX_TEMPLATES_PATH = "%s%sbibconvert%sconfig" % (CFG_ETCDIR, os.sep, os.sep) def convert(xmltext, template_filename=None, template_source=None): """ Processes an XML text according to a template, and returns the result. The template can be given either by name (or by path) or by source. If source is given, name is ignored. bibconvert_bfx_engine will look for template_filename in standard directories for templates. If not found, template_filename will be assumed to be a path to a template. If none can be found, return None. Raises an exception if cannot find an appropriate XPath module. @param xmltext The string representation of the XML to process @param template_filename The name of the template to use for the processing @param template_source The configuration describing the processing. @return the transformed XML text. """ if processor_type == -1: # No XPath processor found raise "No XPath processor could be found" - + # Retrieve template and read it if template_source: template = template_source elif template_filename: try: path_to_templates = (CFG_BFX_TEMPLATES_PATH + os.sep + template_filename) if os.path.exists(path_to_templates): template = file(path_to_templates).read() elif os.path.exists(template_filename): template = file(template_filename).read() else: sys.stderr.write(template_filename +' does not exist.') return None except IOError: sys.stderr.write(template_filename +' could not be read.') return None else: sys.stderr.write(template_filename +' was not given.') return None # Prepare some variables - out_file = StringIO() # Virtual file-like object to write result in + out_file = StringIO() # Virtual file-like object to write result in trans = XML2XMLTranslator() trans.set_xml_source(xmltext) parser = bibformat_bfx_engine.BFXParser(trans) - + # Load template # This might print some info. Redirect to stderr # but do no print on standard output standard_output = sys.stdout sys.stdout = sys.stderr # Always set 'template_name' to None, otherwise # bibformat for XML will look for it in wrong directory template_tree = parser.load_template(template_name=None, template_source=template) sys.stdout = standard_output # Transform the source using loaded template parser.walk(template_tree, out_file) - output = out_file.getvalue() + output = out_file.getvalue() return output class XML2XMLTranslator: """ Generic translator for XML. """ def __init__(self): ''' Create an instance of the translator and init with the list of the defined labels and their rules. ''' self.xml_source = '' self.dom = None self.current_node = None self.namespaces = {} def is_defined(self, name): ''' Check whether a variable is defined. Accept all names. get_value will return empty string if not exist - + @param name the name of the variable ''' return True ## context = Context(self.current_node, processorNss=self.namespaces) - + ## results_list = Evaluate(name, context=context) ## if results_list != []: ## return True ## else: ## return False - + def get_num_elements(self, name): ''' An API function to get the number of elements for a variable. Do not use this function to build loops, Use iterator instead. ''' context = Context(self.current_node, processorNss=self.namespaces) results_list = Evaluate(name, context=context) return len(results_list) def get_value(self, name, display_type='value'): ''' The API function for quering the translator for values of a certain variable. Called in a loop will result in a different value each time. - + @param name the name of the variable you want the value of @param display_type an optional value for the type of the desired output, one of: value, tag, ind1, ind2, code, fulltag; These can be easily added in the proper place of the code (display_value) ''' context = Context(self.current_node, processorNss=self.namespaces) results_list = Evaluate(name, context=context) if len(results_list) == 0: return '' # Select text node value of selected nodes # and concatenate return ' '.join([node.childNodes[0].nodeValue.encode( "utf-8" ) for node in results_list]) - + def iterator(self, name): ''' An iterator over the values of a certain name. The iterator changes state of interenal variables and objects. When calling get_value in a loop, this will result each time in a different value. ''' saved_node = self.current_node context = Context(self.current_node, processorNss=self.namespaces) results_list = Evaluate(name, context=context) for node in results_list: self.current_node = node yield node self.current_node = saved_node - + def call_function(self, function_name, parameters=None): ''' Call an external element which is a Python file, using BibFormat @param function_name the name of the function to call @param parameters a dictionary of the parameters to pass as key=value pairs @return a string value, which is the result of the function call ''' #No support for this in bibconvert_bfx_engine ## if parameters is None: ## parameters = {} ## bfo = BibFormatObject(self.recID) ## format_element = get_format_element(function_name) ## (value, errors) = eval_format_element(format_element, bfo, parameters) ## #to do: check errors from function call ## return value return "" - + def set_xml_source(self, xmltext): """ Specify the source XML for this transformer @param xmltext the XML text representation to use as source """ self.xml_source = xmltext self.dom = minidom.parseString(xmltext) self.current_node = self.dom self.namespaces = build_namespaces(self.dom) def doc_order_iter_filter(node, filter_func): """ Iterates over each node in document order, applying the filter function to each in turn, starting with the given node, and yielding each node in cases where the filter function computes true @param node the starting point (subtree rooted at node will be iterated over document order) @param filter_func a callable object taking a node and returning true or false """ if filter_func(node): yield node for child in node.childNodes: for cn in doc_order_iter_filter(child, filter_func): yield cn return def get_all_elements(node): """ Returns an iterator (using document order) over all element nodes that are descendants of the given one """ return doc_order_iter_filter( node, lambda n: n.nodeType == Node.ELEMENT_NODE ) def build_namespaces(dom): """ Build the namespaces present in dom tree. - + Necessary to use prior processing an XML file in order to execute XPath queries correctly. @param dom the dom tree to parse to discover namespaces @return a dictionary with prefix as key and namespace as value """ namespaces = {} for elem in get_all_elements(dom): if elem.prefix is not None: namespaces[elem.prefix] = elem.namespaceURI - + for attr in elem.attributes.values(): if attr.prefix is not None: namespaces[attr.prefix] = attr.namespaceURI return namespaces def bc_profile(): """ Runs a benchmark """ global xmltext - + convert(xmltext, 'oaidc2marcxml.bfx') return def benchmark(): """ Benchmark the module, using profile and pstats """ import profile import pstats from invenio.bibformat import record_get_xml global xmltext - + xmltext = record_get_xml(10, 'oai_dc') profile.run('bc_profile()', "bibconvert_xslt_profile") p = pstats.Stats("bibconvert_xslt_profile") p.strip_dirs().sort_stats("cumulative").print_stats() if __name__ == "__main__": # FIXME: Implement command line options pass diff --git a/modules/bibconvert/lib/bibconvert_xslt_engine.py b/modules/bibconvert/lib/bibconvert_xslt_engine.py index 56c645f83..d38c8b9e7 100644 --- a/modules/bibconvert/lib/bibconvert_xslt_engine.py +++ b/modules/bibconvert/lib/bibconvert_xslt_engine.py @@ -1,263 +1,263 @@ # -*- coding: utf-8 -*- ## ## $Id$ ## ## This file is part of CDS Invenio. ## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 CERN. ## ## CDS Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## CDS Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -## General Public License for more details. +## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with CDS Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """ bibconvert_xslt_engine - Wrapper for an XSLT engine. Customized to support BibConvert functions through the use of XPath 'format' function. Dependencies: Need one of the following XSLT processors: - libxml2 & libxslt - 4suite Used by: bibconvert.in FIXME: - Find better namespace for functions - Find less bogus URI (given as param to processor) for source and template - Implement command-line options - Think about better handling of 'value' parameter in bibconvert_function_* """ __revision__ = "$Id$" import sys import os from invenio.config import \ - etcdir, \ + CFG_ETCDIR, \ weburl from invenio.bibconvert import FormatField # The namespace used for BibConvert functions CFG_BIBCONVERT_FUNCTION_NS = "http://cdsweb.cern.ch/bibconvert/fn" # Import one XSLT processor # # processor_type: # -1 : No processor found # 0 : libxslt # 1 : 4suite processor_type = -1 try: # libxml2 & libxslt import libxml2 import libxslt processor_type = 0 except ImportError: pass if processor_type == -1: try: # 4suite from Ft.Xml.Xslt import Processor from Ft.Xml import InputSource from xml.dom import Node processor_type = 1 except ImportError: pass -CFG_BIBCONVERT_XSL_PATH = "%s%sbibconvert%sconfig" % (etcdir, os.sep, os.sep) +CFG_BIBCONVERT_XSL_PATH = "%s%sbibconvert%sconfig" % (CFG_ETCDIR, os.sep, os.sep) def bibconvert_function_libxslt(ctx, value, func): """ libxslt extension function: Bridge between BibConvert formatting functions and XSL stylesheets. Can be used in that way in XSL stylesheet (provided xmlns:fn="http://cdsweb.cern.ch/bibconvert/fn" has been declared): (Adds strings 'mypref' and 'mysuff' as prefix/suffix to current node value, using BibConvert ADD function) - + if value is int, value is converted to string if value is Node (PyCObj), first child node (text node) is taken as value """ try: if isinstance(value, str): string_value = value elif isinstance(value, int): string_value = str(value) else: string_value = libxml2.xmlNode(_obj=value[0]).children.content return FormatField(string_value, func).rstrip('\n') except Exception, err: sys.stderr.write("Error during formatting function evaluation: " + \ str(err) + \ '\n') - + return '' def bibconvert_function_4suite(ctx, value, func): """ 4suite extension function: Bridge between BibConvert formatting functions and XSL stylesheets. Can be used in that way in XSL stylesheet (provided xmlns:fn="http://cdsweb.cern.ch/bibconvert/fn" has been declared): (Adds strings 'mypref' and 'mysuff' as prefix/suffix to current node value, using BibConvert ADD function) if value is int, value is converted to string if value is Node, first child node (text node) is taken as value """ try: if len(value) > 0 and isinstance(value[0], Node): string_value = value[0].firstChild.nodeValue if string_value is None: string_value = '' else: string_value = str(value) return FormatField(string_value, func).rstrip('\n') - + except Exception, err: sys.stderr.write("Error during formatting function evaluation: " + \ str(err) + \ '\n') - + return '' def convert(xmltext, template_filename=None, template_source=None): """ Processes an XML text according to a template, and returns the result. The template can be given either by name (or by path) or by source. If source is given, name is ignored. bibconvert_xslt_engine will look for template_filename in standard directories for templates. If not found, template_filename will be assumed to be a path to a template. If none can be found, return None. Raises an exception if cannot find an appropriate XSLT processor. @param xmltext The string representation of the XML to process @param template_filename The name of the template to use for the processing @param template_source The configuration describing the processing. @return the transformed XML text. """ if processor_type == -1: # No XSLT processor found raise "No XSLT processor could be found" - + # Retrieve template and read it if template_source: template = template_source elif template_filename: try: path_to_templates = (CFG_BIBCONVERT_XSL_PATH + os.sep + template_filename) if os.path.exists(path_to_templates): template = file(path_to_templates).read() elif os.path.exists(template_filename): template = file(template_filename).read() else: sys.stderr.write(template_filename +' does not exist.') return None except IOError: sys.stderr.write(template_filename +' could not be read.') return None else: sys.stderr.write(template_filename +' was not given.') return None result = "" if processor_type == 0: # libxml2 & libxslt - + # Register BibConvert functions for use in XSL libxslt.registerExtModuleFunction("format", CFG_BIBCONVERT_FUNCTION_NS, bibconvert_function_libxslt) # Load template and source template_xml = libxml2.parseDoc(template) processor = libxslt.parseStylesheetDoc(template_xml) source = libxml2.parseDoc(xmltext) # Transform result_object = processor.applyStylesheet(source, None) result = processor.saveResultToString(result_object) # Deallocate processor.freeStylesheet() source.freeDoc() result_object.freeDoc() elif processor_type == 1: # 4suite # Init processor = Processor.Processor() - + # Register BibConvert functions for use in XSL processor.registerExtensionFunction(CFG_BIBCONVERT_FUNCTION_NS, "format", bibconvert_function_4suite) # Load template and source transform = InputSource.DefaultFactory.fromString(template, uri=weburl) source = InputSource.DefaultFactory.fromString(xmltext, uri=weburl) processor.appendStylesheet(transform) # Transform result = processor.run(source) else: sys.stderr.write("No XSLT processor could be found") - + return result ## def bc_profile(): ## """ ## Runs a benchmark ## """ ## global xmltext ## convert(xmltext, 'oaidc2marcxml.xsl') ## return ## def benchmark(): ## """ ## Benchmark the module, using profile and pstats ## """ ## import profile ## import pstats ## from invenio.bibformat import record_get_xml ## global xmltext - + ## xmltext = record_get_xml(10, 'oai_dc') ## profile.run('bc_profile()', "bibconvert_xslt_profile") ## p = pstats.Stats("bibconvert_xslt_profile") ## p.strip_dirs().sort_stats("cumulative").print_stats() - + if __name__ == "__main__": pass diff --git a/modules/bibedit/lib/bibedit_engine.py b/modules/bibedit/lib/bibedit_engine.py index 2271c4fd6..82c949d3a 100644 --- a/modules/bibedit/lib/bibedit_engine.py +++ b/modules/bibedit/lib/bibedit_engine.py @@ -1,390 +1,390 @@ ## $Id$ ## ## This file is part of CDS Invenio. ## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 CERN. ## ## CDS Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## CDS Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with CDS Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. __revision__ = "$Id$" import os import time import cPickle -from invenio.config import bindir, tmpdir +from invenio.config import CFG_BINDIR, CFG_TMPDIR from invenio.bibedit_dblayer import marc_to_split_tag from invenio.bibedit_config import * from invenio.search_engine import print_record, record_exists from invenio.bibrecord import record_xml_output, create_record, field_add_subfield, record_add_field import invenio.template bibedit_templates = invenio.template.load('bibedit') def perform_request_index(ln, recid, cancel, delete, confirm_delete, uid, temp, format_tag, edit_tag, delete_tag, num_field, add, dict_value=None): """Returns the body of main page. """ errors = [] warnings = [] body = '' if cancel != 0: os.system("rm %s.tmp" % get_file_path(cancel)) if delete != 0: if confirm_delete != 0: body = bibedit_templates.tmpl_deleted(ln, 1, delete, temp, format_tag) else: (record, junk) = get_record(ln, delete, uid, "false") add_field(delete, uid, record, "980", "", "", "c", "DELETED") save_temp_record(record, uid, "%s.tmp" % get_file_path(delete)) save_xml_record(delete) body = bibedit_templates.tmpl_deleted(ln) else: if recid != 0 : if record_exists(recid) > 0: (record, body) = get_record(ln, recid, uid, temp) if record != '': if add == 3: body = '' if edit_tag is not None and dict_value is not None: record = edit_record(recid, uid, record, edit_tag, dict_value, num_field) if delete_tag is not None and num_field is not None: record = delete_field(recid, uid, record, delete_tag, num_field) if add == 4: tag = dict_value.get("add_tag" , '') ind1 = dict_value.get("add_ind1" , '') ind2 = dict_value.get("add_ind2" , '') subcode = dict_value.get("add_subcode", '') value = dict_value.get("add_value" , '') if tag != '' and subcode != '' and value != '': record = add_field(recid, uid, record, tag, ind1, ind2, subcode, value) body += bibedit_templates.tmpl_table_header(ln, "record", recid, temp, format_tag, add=add) keys = record.keys() keys.sort() for tag in keys: fields = record.get(str(tag), "empty") if fields != "empty": for field in fields: if field[0]: # Only display if has subfield(s) body += bibedit_templates.tmpl_table_value(ln, recid, tag, field, format_tag, "record", add) if add == 3: body += bibedit_templates.tmpl_table_value(ln, recid, '', [], format_tag, "record", add, 1) body += bibedit_templates.tmpl_table_footer(ln, "record", add) else: body = bibedit_templates.tmpl_record_choice_box(ln, 2) else: if record_exists(recid) == -1: body = bibedit_templates.tmpl_record_choice_box(ln, 3) else: body = bibedit_templates.tmpl_record_choice_box(ln, 1) else: body = bibedit_templates.tmpl_record_choice_box(ln, 0) return (body, errors, warnings) def perform_request_edit(ln, recid, uid, tag, num_field, num_subfield, format_tag, temp, del_subfield, add, dict_value): """Returns the body of edit page. """ errors = [] warnings = [] body = '' if record_exists(recid) in (-1, 0): body = bibedit_templates.tmpl_record_choice_box(ln, 0) return (body, errors, warnings) (record, junk) = get_record(ln, recid, uid, temp) if del_subfield is not None: record = delete_subfield(recid, uid, record, tag, num_field, num_subfield) if add == 2: subcode = dict_value.get("add_subcode", "empty") value = dict_value.get("add_value" , "empty") if subcode == '': subcode = "empty" if value == '': value = "empty" if value != "empty" and subcode != "empty": record = add_subfield(recid, uid, tag, record, num_field, subcode, value) body += bibedit_templates.tmpl_table_header(ln, "edit", recid, temp=temp, tag=tag, num_field=num_field, add=add) tag = tag[:3] fields = record.get(str(tag), "empty") if fields != "empty": for field in fields: if field[4] == int(num_field) : body += bibedit_templates.tmpl_table_value(ln, recid, tag, field, format_tag, "edit", add) break body += bibedit_templates.tmpl_table_footer(ln, "edit", add) return (body, errors, warnings) def perform_request_submit(ln, recid): """Submits record to the database. """ save_xml_record(recid) errors = [] warnings = [] body = bibedit_templates.tmpl_submit(ln) return (body, errors, warnings) def get_file_path(recid): """ return the file path of record. """ - return "%s/%s_%s" % (tmpdir, CFG_BIBEDIT_TMPFILENAMEPREFIX, str(recid)) + return "%s/%s_%s" % (CFG_TMPDIR, CFG_BIBEDIT_TMPFILENAMEPREFIX, str(recid)) def save_xml_record(recid): """Saves XML record file to database. """ file_path = get_file_path(recid) file_temp = open("%s.xml" % file_path, 'w') file_temp.write(record_xml_output(get_temp_record("%s.tmp" % file_path)[1])) file_temp.close() - os.system("%s/bibupload -u bibedit -r %s.xml" % (bindir, file_path)) + os.system("%s/bibupload -u bibedit -r %s.xml" % (CFG_BINDIR, file_path)) os.system("rm %s.tmp" % file_path) def save_temp_record(record, uid, file_path): """ Save record dict in temp file. """ file_temp = open(file_path, "w") cPickle.dump([uid, record], file_temp) file_temp.close() def get_temp_record(file_path): """Loads record dict from a temp file. """ file_temp = open(file_path) [uid, record] = cPickle.load(file_temp) file_temp.close() return (uid, record) def get_record(ln, recid, uid, temp): """Returns a record dict, and warning message in case of error. """ file_path = get_file_path(recid) if temp != "false": warning_temp_file = bibedit_templates.tmpl_warning_temp_file(ln) else: warning_temp_file = '' if os.path.isfile("%s.tmp" % file_path): (uid_record_temp, record) = get_temp_record("%s.tmp" % file_path) if uid_record_temp != uid: time_tmp_file = os.path.getmtime("%s.tmp" % file_path) time_out_file = int(time.time()) - CFG_BIBEDIT_TIMEOUT if time_tmp_file < time_out_file : os.system("rm %s.tmp" % file_path) record = create_record(print_record(recid, 'xm'))[0] save_temp_record(record, uid, "%s.tmp" % file_path) else: record = '' else: record = create_record(print_record(recid, 'xm'))[0] save_temp_record(record, uid, "%s.tmp" % file_path) return (record, warning_temp_file) ######### EDIT ######### def edit_record(recid, uid, record, edit_tag, dict_value, num_field): """Edits value of a record. """ for num_subfield in range( len(dict_value.keys())/3 ): # Iterate over subfield indices of field new_subcode = dict_value.get("subcode%s" % num_subfield, None) old_subcode = dict_value.get("old_subcode%s" % num_subfield, None) new_value = dict_value.get("value%s" % num_subfield, None) old_value = dict_value.get("old_value%s" % num_subfield, None) if new_value is not None and old_value is not None \ and new_subcode is not None and old_subcode is not None: # Make sure we actually get these values if new_value != '' and new_subcode != '': # Forbid empty values if new_value != old_value or \ new_subcode != old_subcode: # only change when necessary edit_tag = edit_tag[:5] record = edit_subfield(record, edit_tag, new_subcode, new_value, num_field, num_subfield) save_temp_record(record, uid, "%s.tmp" % get_file_path(recid)) return record def edit_subfield(record, tag, new_subcode, new_value, num_field, num_subfield): """Edits the value of a subfield. """ new_value = bibedit_templates.tmpl_clean_value(str(new_value), "html") (tag, ind1, ind2, junk) = marc_to_split_tag(tag) fields = record.get(str(tag), None) if fields is not None: i = -1 for field in fields: i += 1 if field[4] == int(num_field): subfields = field[0] j = -1 for subfield in subfields: j += 1 if j == num_subfield: # Rely on counted index to identify subfield to edit... record[tag][i][0][j] = (new_subcode, new_value) break break return record ######### ADD ######## def add_field(recid, uid, record, tag, ind1, ind2, subcode, value_subfield): """Adds a new field to the record. """ tag = tag[:3] new_field_number = record_add_field(record, tag, ind1, ind2) record = add_subfield(recid, uid, tag, record, new_field_number, subcode, value_subfield) save_temp_record(record, uid, "%s.tmp" % get_file_path(recid)) return record def add_subfield(recid, uid, tag, record, num_field, subcode, value): """Adds a new subfield to a field. """ tag = tag[:3] fields = record.get(str(tag)) i = -1 for field in fields: i += 1 if field[4] == int(num_field) : subfields = field[0] same_subfield = False for subfield in subfields: if subfield[0] == subcode: same_subfield = True if not same_subfield: field_add_subfield(record[tag][i], subcode, value) break save_temp_record(record, uid, "%s.tmp" % get_file_path(recid)) return record ######### DELETE ######## def delete_field(recid, uid, record, tag, num_field): """Deletes field in record. """ (tag, junk, junk, junk) = marc_to_split_tag(tag) tmp = [] for field in record[tag]: if field[4] != int(num_field) : tmp.append(field) if tmp != []: record[tag] = tmp else: del record[tag] save_temp_record(record, uid, "%s.tmp" % get_file_path(recid)) return record def delete_subfield(recid, uid, record, tag, num_field, num_subfield): """Deletes subfield of a field. """ (tag, junk, junk, subcode) = marc_to_split_tag(tag) tmp = [] i = -1 deleted = False for field in record[tag]: i += 1 if field[4] == int(num_field): j = 0 for subfield in field[0]: if j != num_subfield: #if subfield[0] != subcode or deleted == True: tmp.append((subfield[0], subfield[1])) #else: # deleted = True j += 1 break record[tag][i] = (tmp, record[tag][i][1], record[tag][i][2], record[tag][i][3], record[tag][i][4]) save_temp_record(record, uid, "%s.tmp" % get_file_path(recid)) return record diff --git a/modules/bibedit/lib/bibrecord_config.py b/modules/bibedit/lib/bibrecord_config.py index 0c0d15dc4..5b2957f25 100644 --- a/modules/bibedit/lib/bibrecord_config.py +++ b/modules/bibedit/lib/bibrecord_config.py @@ -1,55 +1,55 @@ ## $Id$ ## This file is part of CDS Invenio. ## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 CERN. ## ## CDS Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## CDS Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -## General Public License for more details. +## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with CDS Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. ### CONFIGURATION OPTIONS FOR BIBRECORD LIBRARY """bibrecord configuration""" __revision__ = "$Id$" -from invenio.config import etcdir +from invenio.config import CFG_ETCDIR # location of the MARC21 DTD file: -CFG_MARC21_DTD = "%s/bibedit/MARC21slim.dtd" % etcdir +CFG_MARC21_DTD = "%s/bibedit/MARC21slim.dtd" % CFG_ETCDIR # pylint: disable-msg=C0301 # internal dictionary of warning messages: CFG_BIBRECORD_WARNING_MSGS = { 0: '' , 1: 'WARNING: tag missing for field(s)\nValue stored with tag \'000\'', 2: 'WARNING: bad range for tags (tag must be in range 001-999)\nValue stored with tag \'000\'', 3: 'WARNING: Missing atributte \'code\' for subfield\nValue stored with code \'\'', 4: 'WARNING: Missing attributte \'ind1\'\n Value stored with ind1 = \'\'', 5: 'WARNING: Missing attributte \'ind2\'\n Value stored with ind2 = \'\'', 6: 'Import Error\n', 7: 'WARNING: value expected of type string.', 8: 'WARNING: empty datafield', 98:'WARNING: problems importing invenio', 99: 'Document not well formed' - } + } # verbose level to be used when creating records from XML: (0=least, ..., 9=most) CFG_BIBRECORD_DEFAULT_VERBOSE_LEVEL = 0 # correction level to be used when creating records from XML: (0=no, 1=yes) CFG_BIBRECORD_DEFAULT_CORRECT = 0 # XML parsers available: (0=minidom, 1=4suite, 2=PyRXP) -CFG_BIBRECORD_PARSERS_AVAILABLE = [0, 1, 2] +CFG_BIBRECORD_PARSERS_AVAILABLE = [0, 1, 2] diff --git a/modules/bibedit/lib/bibrecord_tests.py b/modules/bibedit/lib/bibrecord_tests.py index f01fa9866..f1b8632e1 100644 --- a/modules/bibedit/lib/bibrecord_tests.py +++ b/modules/bibedit/lib/bibrecord_tests.py @@ -1,822 +1,822 @@ # -*- coding: utf-8 -*- ## ## $Id$ ## ## This file is part of CDS Invenio. ## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 CERN. ## ## CDS Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## CDS Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with CDS Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. __revision__ = "$Id$" import unittest from string import expandtabs, replace -from invenio.config import tmpdir, etcdir +from invenio.config import CFG_TMPDIR, CFG_ETCDIR from invenio import bibrecord # pylint: disable-msg=C0301 class BibRecordSanityTest(unittest.TestCase): """ bibrecord - sanity test (xml -> create records -> xml)""" def test_for_sanity(self): """ bibrecord - demo file sanity test (xml -> create records -> xml)""" - f = open(tmpdir + '/demobibdata.xml', 'r') + f = open(CFG_TMPDIR + '/demobibdata.xml', 'r') xmltext = f.read() f.close() # let's try to reproduce the demo XML MARC file by parsing it and printing it back: recs = map((lambda x:x[0]), bibrecord.create_records(xmltext)) xmltext_reproduced = bibrecord.records_xml_output(recs) x = xmltext_reproduced y = xmltext # 'normalize' the two XML MARC files for the purpose of comparing x = expandtabs(x) y = expandtabs(y) x = x.replace(' ', '') y = y.replace(' ', '') - x = x.replace('\n' % etcdir, + x = x.replace('\n' % CFG_ETCDIR, '') x = x.replace('', "\n") x = x.replace('', "\n\n") x = x[1:100] y = y[1:100] self.assertEqual(x, y) class BibRecordSuccessTest(unittest.TestCase): """ bibrecord - demo file parsing test """ def setUp(self): # pylint: disable-msg=C0103 """Initialize stuff""" - f = open(tmpdir + '/demobibdata.xml', 'r') + f = open(CFG_TMPDIR + '/demobibdata.xml', 'r') xmltext = f.read() f.close() self.recs = map((lambda x: x[0]), bibrecord.create_records(xmltext)) def test_records_created(self): """ bibrecord - demo file how many records are created """ self.assertEqual(95, len(self.recs)) def test_tags_created(self): """ bibrecord - demo file which tags are created """ ## check if the tags are correct # tags = ['020', '037', '041', '080', '088', '100', '245', '246', '250', '260', '270', '300', '340', '490', '500', '502', '520', '590', '595', '650', '653', '690', '700', '710', '856', '909', '980', '999'] tags = [u'003', u'005', '020', '035', '037', '041', '080', '088', '100', '245', '246', '250', '260', '269', '270', '300', '340', '490', '500', '502', '520', '590', '595', '650', '653', '690', '695', '700', '710', '720', '856', '859', '901', '909', '916', '960', '961', '962', '963', '970', '980', '999', 'FFT'] t = [] for rec in self.recs: t.extend(rec.keys()) t.sort() #eliminate the elements repeated tt = [] for x in t: if not x in tt: tt.append(x) self.assertEqual(tags, tt) def test_fields_created(self): """bibrecord - demo file how many fields are created""" ## check if the number of fields for each record is correct fields = [14, 14, 8, 11, 11, 12, 11, 15, 10, 18, 14, 16, 10, 9, 15, 10, 11, 11, 11, 9, 10, 10, 10, 8, 8, 8, 9, 9, 9, 10, 8, 8, 8, 8, 14, 13, 14, 14, 15, 12, 12, 12, 15, 14, 12, 16, 16, 15, 15, 14, 16, 15, 15, 15, 16, 15, 16, 15, 15, 16, 15, 14, 14, 15, 12, 13, 11, 15, 8, 11, 14, 13, 12, 13, 6, 6, 25, 24, 27, 26, 26, 24, 26, 27, 25, 28, 24, 23, 27, 25, 25, 26, 26, 24, 19] cr = [] ret = [] for rec in self.recs: cr.append(len(rec.values())) ret.append(rec) self.assertEqual(fields, cr) class BibRecordBadInputTreatmentTest(unittest.TestCase): """ bibrecord - testing for bad input treatment """ def test_wrong_attribute(self): """bibrecord - bad input subfield \'cde\' instead of \'code\'""" ws = bibrecord.CFG_BIBRECORD_WARNING_MSGS xml_error1 = """ 33 eng Doe, John On the foo and bar """ (rec, st, e) = bibrecord.create_record(xml_error1, 1, 1) ee ='' for i in e: if type(i).__name__ == 'str': if i.count(ws[3])>0: ee = i self.assertEqual(bibrecord.warning((3, '(field number: 4)')), ee) def test_missing_attribute(self): """ bibrecord - bad input missing \"tag\" """ ws = bibrecord.CFG_BIBRECORD_WARNING_MSGS xml_error2 = """ 33 eng Doe, John On the foo and bar """ (rec, st, e) = bibrecord.create_record(xml_error2, 1, 1) ee = '' for i in e: if type(i).__name__ == 'str': if i.count(ws[1])>0: ee = i self.assertEqual(bibrecord.warning((1, '(field number(s): [2])')), ee) def test_empty_datafield(self): """ bibrecord - bad input no subfield """ ws = bibrecord.CFG_BIBRECORD_WARNING_MSGS xml_error3 = """ 33 Doe, John On the foo and bar """ (rec, st, e) = bibrecord.create_record(xml_error3, 1, 1) ee = '' for i in e: if type(i).__name__ == 'str': if i.count(ws[8])>0: ee = i self.assertEqual(bibrecord.warning((8, '(field number: 2)')), ee) def test_missing_tag(self): """bibrecord - bad input missing end \"tag\" """ ws = bibrecord.CFG_BIBRECORD_WARNING_MSGS xml_error4 = """ 33 eng Doe, John On the foo and bar """ (rec, st, e) = bibrecord.create_record(xml_error4, 1, 1) ee = '' for i in e: if type(i).__name__ == 'str': if i.count(ws[99])>0: ee = i self.assertEqual(bibrecord.warning((99, '(Tagname : datafield)')), ee) class BibRecordAccentedUnicodeLettersTest(unittest.TestCase): """ bibrecord - testing accented UTF-8 letters """ def setUp(self): # pylint: disable-msg=C0103 """Initialize stuff""" self.xml_example_record = """ 33 eng Döè1, John Doe2, J>ohn editor Пушкин On the foo and bar2 """ (self.rec, st, e) = bibrecord.create_record(self.xml_example_record, 1, 1) def test_accented_unicode_characters(self): """bibrecord - accented Unicode letters""" self.assertEqual(self.xml_example_record, bibrecord.record_xml_output(self.rec)) self.assertEqual(bibrecord.record_get_field_instances(self.rec, "100", " ", " "), [([('a', 'Döè1, John')], " ", " ", "", 3), ([('a', 'Doe2, J>ohn'), ('b', 'editor')], " ", " ", "", 4)]) self.assertEqual(bibrecord.record_get_field_instances(self.rec, "245", " ", "1"), [([('a', 'Пушкин')], " ", '1', "", 5)]) class BibRecordGettingFieldValuesTest(unittest.TestCase): """ bibrecord - testing for getting field/subfield values """ def setUp(self): # pylint: disable-msg=C0103 """Initialize stuff""" xml_example_record = """ 33 eng Doe1, John Doe2, John editor On the foo and bar1 On the foo and bar2 """ (self.rec, st, e) = bibrecord.create_record(xml_example_record, 1, 1) def test_get_field_instances(self): """bibrecord - getting field instances""" self.assertEqual(bibrecord.record_get_field_instances(self.rec, "100", " ", " "), [([('a', 'Doe1, John')], " ", " ", "", 3), ([('a', 'Doe2, John'), ('b', 'editor')], " ", " ", "", 4)]) self.assertEqual(bibrecord.record_get_field_instances(self.rec, "", " ", " "), [('245', [([('a', 'On the foo and bar1')], " ", '1', "", 5), ([('a', 'On the foo and bar2')], " ", '2', "", 6)]), ('001', [([], " ", " ", '33', 1)]), ('100', [([('a', 'Doe1, John')], " ", " ", "", 3), ([('a', 'Doe2, John'), ('b', 'editor')], " ", " ", "", 4)]), ('041', [([('a', 'eng')], " ", " ", "", 2)])]) def test_get_field_values(self): """bibrecord - getting field values""" self.assertEqual(bibrecord.record_get_field_values(self.rec, "100", " ", " ", "a"), ['Doe1, John', 'Doe2, John']) self.assertEqual(bibrecord.record_get_field_values(self.rec, "100", " ", " ", "b"), ['editor']) def test_get_field_value(self): """bibrecord - getting first field value""" self.assertEqual(bibrecord.record_get_field_value(self.rec, "100", " ", " ", "a"), 'Doe1, John') self.assertEqual(bibrecord.record_get_field_value(self.rec, "100", " ", " ", "b"), 'editor') def test_get_subfield_values(self): """bibrecord - getting subfield values""" fi1, fi2 = bibrecord.record_get_field_instances(self.rec, "100", " ", " ") self.assertEqual(bibrecord.field_get_subfield_values(fi1, "b"), []) self.assertEqual(bibrecord.field_get_subfield_values(fi2, "b"), ["editor"]) class BibRecordGettingFieldValuesViaWildcardsTest(unittest.TestCase): """ bibrecord - testing for getting field/subfield values via wildcards """ def setUp(self): # pylint: disable-msg=C0103 """Initialize stuff""" xml_example_record = """ 1 val1 val2 val3 val4a val4b val5 val6 val7a val7b """ (self.rec, st, e) = bibrecord.create_record(xml_example_record, 1, 1) def test_get_field_instances_via_wildcard(self): """bibrecord - getting field instances via wildcards""" self.assertEqual(bibrecord.record_get_field_instances(self.rec, "100", " ", " "), []) self.assertEqual(bibrecord.record_get_field_instances(self.rec, "100", "%", " "), []) self.assertEqual(bibrecord.record_get_field_instances(self.rec, "100", "%", "%"), [([('a', 'val1')], 'C', '5', "", 2)]) self.assertEqual(bibrecord.record_get_field_instances(self.rec, "55%", "A", "%"), [([('a', 'val2')], 'A', 'B', "", 3), ([('a', 'val3')], 'A', " ", "", 4), ([('a', 'val6')], 'A', 'C', "", 7), ([('a', 'val7a'), ('b', 'val7b')], 'A', " ", "", 8)]) self.assertEqual(bibrecord.record_get_field_instances(self.rec, "55%", "A", " "), [([('a', 'val3')], 'A', " ", "", 4), ([('a', 'val7a'), ('b', 'val7b')], 'A', " ", "", 8)]) self.assertEqual(bibrecord.record_get_field_instances(self.rec, "556", "A", " "), [([('a', 'val7a'), ('b', 'val7b')], 'A', " ", "", 8)]) def test_get_field_values_via_wildcard(self): """bibrecord - getting field values via wildcards""" self.assertEqual(bibrecord.record_get_field_values(self.rec, "100", " ", " ", " "), []) self.assertEqual(bibrecord.record_get_field_values(self.rec, "100", "%", " ", " "), []) self.assertEqual(bibrecord.record_get_field_values(self.rec, "100", " ", "%", " "), []) self.assertEqual(bibrecord.record_get_field_values(self.rec, "100", "%", "%", " "), []) self.assertEqual(bibrecord.record_get_field_values(self.rec, "100", "%", "%", "z"), []) self.assertEqual(bibrecord.record_get_field_values(self.rec, "100", " ", " ", "%"), []) self.assertEqual(bibrecord.record_get_field_values(self.rec, "100", " ", " ", "a"), []) self.assertEqual(bibrecord.record_get_field_values(self.rec, "100", "%", " ", "a"), []) self.assertEqual(bibrecord.record_get_field_values(self.rec, "100", "%", "%", "a"), ['val1']) self.assertEqual(bibrecord.record_get_field_values(self.rec, "100", "%", "%", "%"), ['val1']) self.assertEqual(bibrecord.record_get_field_values(self.rec, "55%", "A", "%", "a"), ['val2', 'val3', 'val6', 'val7a']) self.assertEqual(bibrecord.record_get_field_values(self.rec, "55%", "A", " ", "a"), ['val3', 'val7a']) self.assertEqual(bibrecord.record_get_field_values(self.rec, "556", "A", " ", "a"), ['val7a']) self.assertEqual(bibrecord.record_get_field_values(self.rec, "555", " ", " ", " "), []) self.assertEqual(bibrecord.record_get_field_values(self.rec, "555", " ", " ", "z"), []) self.assertEqual(bibrecord.record_get_field_values(self.rec, "555", " ", " ", "%"), ['val4a', 'val4b']) self.assertEqual(bibrecord.record_get_field_values(self.rec, "55%", " ", " ", "b"), ['val4b']) self.assertEqual(bibrecord.record_get_field_values(self.rec, "55%", "%", "%", "b"), ['val4b', 'val7b']) self.assertEqual(bibrecord.record_get_field_values(self.rec, "55%", "A", " ", "b"), ['val7b']) self.assertEqual(bibrecord.record_get_field_values(self.rec, "55%", "A", "%", "b"), ['val7b']) self.assertEqual(bibrecord.record_get_field_values(self.rec, "55%", "A", " ", "a"), ['val3', 'val7a']) self.assertEqual(bibrecord.record_get_field_values(self.rec, "55%", "A", "%", "a"), ['val2', 'val3', 'val6', 'val7a']) self.assertEqual(bibrecord.record_get_field_values(self.rec, "55%", "%", "%", "a"), ['val2', 'val3', 'val4a', 'val5', 'val6', 'val7a']) self.assertEqual(bibrecord.record_get_field_values(self.rec, "55%", " ", " ", "a"), ['val4a']) def test_get_field_value_via_wildcard(self): """bibrecord - getting first field value via wildcards""" self.assertEqual(bibrecord.record_get_field_value(self.rec, "100", " ", " ", " "), '') self.assertEqual(bibrecord.record_get_field_value(self.rec, "100", "%", " ", " "), '') self.assertEqual(bibrecord.record_get_field_value(self.rec, "100", " ", "%", " "), '') self.assertEqual(bibrecord.record_get_field_value(self.rec, "100", "%", "%", " "), '') self.assertEqual(bibrecord.record_get_field_value(self.rec, "100", " ", " ", "%"), '') self.assertEqual(bibrecord.record_get_field_value(self.rec, "100", " ", " ", "a"), '') self.assertEqual(bibrecord.record_get_field_value(self.rec, "100", "%", " ", "a"), '') self.assertEqual(bibrecord.record_get_field_value(self.rec, "100", "%", "%", "a"), 'val1') self.assertEqual(bibrecord.record_get_field_value(self.rec, "100", "%", "%", "%"), 'val1') self.assertEqual(bibrecord.record_get_field_value(self.rec, "55%", "A", "%", "a"), 'val2') self.assertEqual(bibrecord.record_get_field_value(self.rec, "55%", "A", " ", "a"), 'val3') self.assertEqual(bibrecord.record_get_field_value(self.rec, "556", "A", " ", "a"), 'val7a') self.assertEqual(bibrecord.record_get_field_value(self.rec, "555", " ", " ", " "), '') self.assertEqual(bibrecord.record_get_field_value(self.rec, "555", " ", " ", "%"), 'val4a') self.assertEqual(bibrecord.record_get_field_value(self.rec, "55%", " ", " ", "b"), 'val4b') self.assertEqual(bibrecord.record_get_field_value(self.rec, "55%", "%", "%", "b"), 'val4b') self.assertEqual(bibrecord.record_get_field_value(self.rec, "55%", "A", " ", "b"), 'val7b') self.assertEqual(bibrecord.record_get_field_value(self.rec, "55%", "A", "%", "b"), 'val7b') self.assertEqual(bibrecord.record_get_field_value(self.rec, "55%", "A", " ", "a"), 'val3') self.assertEqual(bibrecord.record_get_field_value(self.rec, "55%", "A", "%", "a"), 'val2') self.assertEqual(bibrecord.record_get_field_value(self.rec, "55%", "%", "%", "a"), 'val2') self.assertEqual(bibrecord.record_get_field_value(self.rec, "55%", " ", " ", "a"), 'val4a') class BibRecordAddFieldTest(unittest.TestCase): """ bibrecord - testing adding field """ def setUp(self): # pylint: disable-msg=C0103 """Initialize stuff""" xml_example_record = """ 33 eng Doe1, John Doe2, John editor On the foo and bar1 On the foo and bar2 """ (self.rec, st, e) = bibrecord.create_record(xml_example_record, 1, 1) def test_add_controlfield(self): """bibrecord - adding controlfield""" field_number_1 = bibrecord.record_add_field(self.rec, "003", " ", " ", "SzGeCERN") field_number_2 = bibrecord.record_add_field(self.rec, "004", " ", " ", "Test") self.assertEqual(field_number_1, 7) self.assertEqual(field_number_2, 8) self.assertEqual(bibrecord.record_get_field_values(self.rec, "003", " ", " ", ""), ['SzGeCERN']) self.assertEqual(bibrecord.record_get_field_values(self.rec, "004", " ", " ", ""), ['Test']) def test_add_datafield(self): """bibrecord - adding datafield""" field_number_1 = bibrecord.record_add_field(self.rec, "100", " ", " ", "", [('a', 'Doe3, John')]) field_number_2 = bibrecord.record_add_field(self.rec, "100", " ", " ", "", [('a', 'Doe4, John'), ('b', 'editor')]) self.assertEqual(field_number_1, 7) self.assertEqual(field_number_2, 8) self.assertEqual(bibrecord.record_get_field_values(self.rec, "100", " ", " ", "a"), ['Doe1, John', 'Doe2, John', 'Doe3, John', 'Doe4, John']) self.assertEqual(bibrecord.record_get_field_values(self.rec, "100", " ", " ", "b"), ['editor', 'editor']) def test_add_controlfield_on_desired_position(self): """bibrecord - adding controlfield on desired position""" field_number_1 = bibrecord.record_add_field(self.rec, "005", " ", " ", "Foo", [], 0) field_number_2 = bibrecord.record_add_field(self.rec, "006", " ", " ", "Bar", [], 0) self.assertEqual(field_number_1, 0) self.assertEqual(field_number_2, 7) def test_add_datafield_on_desired_position(self): """bibrecord - adding datafield on desired position""" field_number_1 = bibrecord.record_add_field(self.rec, "100", " ", " ", " ", [('a', 'Doe3, John')], 0) field_number_2 = bibrecord.record_add_field(self.rec, "100", " ", " ", " ", [('a', 'Doe4, John'), ('b', 'editor')], 0) self.assertEqual(field_number_1, 0) self.assertEqual(field_number_2, 7) class BibRecordDeleteFieldTest(unittest.TestCase): """ bibrecord - testing field deletion """ def setUp(self): # pylint: disable-msg=C0103 """Initialize stuff""" xml_example_record = """ 33 eng Doe1, John Doe2, John editor On the foo and bar1 On the foo and bar2 """ (self.rec, st, e) = bibrecord.create_record(xml_example_record, 1, 1) xml_example_record_empty = """ """ (self.rec_empty, st, e) = bibrecord.create_record(xml_example_record_empty, 1, 1) def test_delete_controlfield(self): """bibrecord - deleting controlfield""" bibrecord.record_delete_field(self.rec, "001", " ", " ") self.assertEqual(bibrecord.record_get_field_values(self.rec, "001", " ", " ", " "), []) self.assertEqual(bibrecord.record_get_field_values(self.rec, "100", " ", " ", "b"), ['editor']) self.assertEqual(bibrecord.record_get_field_values(self.rec, "245", " ", "2", "a"), ['On the foo and bar2']) def test_delete_datafield(self): """bibrecord - deleting datafield""" bibrecord.record_delete_field(self.rec, "100", " ", " ") self.assertEqual(bibrecord.record_get_field_values(self.rec, "001", " ", " ", ""), ['33']) self.assertEqual(bibrecord.record_get_field_values(self.rec, "100", " ", " ", "b"), []) bibrecord.record_delete_field(self.rec, "245", " ", " ") self.assertEqual(bibrecord.record_get_field_values(self.rec, "245", " ", "1", "a"), ['On the foo and bar1']) self.assertEqual(bibrecord.record_get_field_values(self.rec, "245", " ", "2", "a"), ['On the foo and bar2']) bibrecord.record_delete_field(self.rec, "245", " ", "2") self.assertEqual(bibrecord.record_get_field_values(self.rec, "245", " ", "1", "a"), ['On the foo and bar1']) self.assertEqual(bibrecord.record_get_field_values(self.rec, "245", " ", "2", "a"), []) def test_add_delete_add_field_to_empty_record(self): """bibrecord - adding, deleting, and adding back a field to an empty record""" field_number_1 = bibrecord.record_add_field(self.rec_empty, "003", " ", " ", "SzGeCERN") self.assertEqual(field_number_1, 1) self.assertEqual(bibrecord.record_get_field_values(self.rec_empty, "003", " ", " ", ""), ['SzGeCERN']) bibrecord.record_delete_field(self.rec_empty, "003", " ", " ") self.assertEqual(bibrecord.record_get_field_values(self.rec_empty, "003", " ", " ", ""), []) field_number_1 = bibrecord.record_add_field(self.rec_empty, "003", " ", " ", "SzGeCERN2") self.assertEqual(field_number_1, 1) self.assertEqual(bibrecord.record_get_field_values(self.rec_empty, "003", " ", " ", ""), ['SzGeCERN2']) class BibRecordSpecialTagParsingTest(unittest.TestCase): """ bibrecord - parsing special tags (FMT, FFT)""" def setUp(self): # pylint: disable-msg=C0103 """setting up example records""" self.xml_example_record_with_fmt = """ 33 eng HB Let us see if this gets inserted well. """ self.xml_example_record_with_fft = """ 33 eng file:///foo.pdf http://bar.com/baz.ps.gz """ self.xml_example_record_with_xyz = """ 33 eng HB Let us see if this gets inserted well. """ def test_parsing_file_containing_fmt_special_tag_with_correcting(self): """bibrecord - parsing special FMT tag, correcting on""" rec, st, e = bibrecord.create_record(self.xml_example_record_with_fmt, 1, 1) self.assertEqual(rec, {u'001': [([], " ", " ", '33', 1)], 'FMT': [([('f', 'HB'), ('g', 'Let us see if this gets inserted well.')], " ", " ", "", 3)], '041': [([('a', 'eng')], " ", " ", "", 2)]}) self.assertEqual(bibrecord.record_get_field_values(rec, "041", " ", " ", "a"), ['eng']) self.assertEqual(bibrecord.record_get_field_values(rec, "FMT", " ", " ", "f"), ['HB']) self.assertEqual(bibrecord.record_get_field_values(rec, "FMT", " ", " ", "g"), ['Let us see if this gets inserted well.']) def test_parsing_file_containing_fmt_special_tag_without_correcting(self): """bibrecord - parsing special FMT tag, correcting off""" rec, st, e = bibrecord.create_record(self.xml_example_record_with_fmt, 1, 0) self.assertEqual(rec, {u'001': [([], " ", " ", '33', 1)], 'FMT': [([('f', 'HB'), ('g', 'Let us see if this gets inserted well.')], " ", " ", "", 3)], '041': [([('a', 'eng')], " ", " ", "", 2)]}) self.assertEqual(bibrecord.record_get_field_values(rec, "041", " ", " ", "a"), ['eng']) self.assertEqual(bibrecord.record_get_field_values(rec, "FMT", " ", " ", "f"), ['HB']) self.assertEqual(bibrecord.record_get_field_values(rec, "FMT", " ", " ", "g"), ['Let us see if this gets inserted well.']) def test_parsing_file_containing_fft_special_tag_with_correcting(self): """bibrecord - parsing special FFT tag, correcting on""" rec, st, e = bibrecord.create_record(self.xml_example_record_with_fft, 1, 1) self.assertEqual(rec, {u'001': [([], " ", " ", '33', 1)], 'FFT': [([('a', 'file:///foo.pdf'), ('a', 'http://bar.com/baz.ps.gz')], " ", " ", "", 3)], '041': [([('a', 'eng')], " ", " ", "", 2)]}) self.assertEqual(bibrecord.record_get_field_values(rec, "041", " ", " ", "a"), ['eng']) self.assertEqual(bibrecord.record_get_field_values(rec, "FFT", " ", " ", "a"), ['file:///foo.pdf', 'http://bar.com/baz.ps.gz']) def test_parsing_file_containing_fft_special_tag_without_correcting(self): """bibrecord - parsing special FFT tag, correcting off""" rec, st, e = bibrecord.create_record(self.xml_example_record_with_fft, 1, 0) self.assertEqual(rec, {u'001': [([], " ", " ", '33', 1)], 'FFT': [([('a', 'file:///foo.pdf'), ('a', 'http://bar.com/baz.ps.gz')], " ", " ", "", 3)], '041': [([('a', 'eng')], " ", " ", "", 2)]}) self.assertEqual(bibrecord.record_get_field_values(rec, "041", " ", " ", "a"), ['eng']) self.assertEqual(bibrecord.record_get_field_values(rec, "FFT", " ", " ", "a"), ['file:///foo.pdf', 'http://bar.com/baz.ps.gz']) def test_parsing_file_containing_xyz_special_tag_with_correcting(self): """bibrecord - parsing unrecognized special XYZ tag, correcting on""" # XYZ should not get accepted when correcting is on; should get changed to 000 rec, st, e = bibrecord.create_record(self.xml_example_record_with_xyz, 1, 1) self.assertEqual(rec, {u'001': [([], " ", " ", '33', 1)], '000': [([('f', 'HB'), ('g', 'Let us see if this gets inserted well.')], " ", " ", "", 3)], '041': [([('a', 'eng')], " ", " ", "", 2)]}) self.assertEqual(bibrecord.record_get_field_values(rec, "041", " ", " ", "a"), ['eng']) self.assertEqual(bibrecord.record_get_field_values(rec, "XYZ", " ", " ", "f"), []) self.assertEqual(bibrecord.record_get_field_values(rec, "XYZ", " ", " ", "g"), []) self.assertEqual(bibrecord.record_get_field_values(rec, "000", " ", " ", "f"), ['HB']) self.assertEqual(bibrecord.record_get_field_values(rec, "000", " ", " ", "g"), ['Let us see if this gets inserted well.']) def test_parsing_file_containing_xyz_special_tag_without_correcting(self): """bibrecord - parsing unrecognized special XYZ tag, correcting off""" # XYZ should get accepted without correcting rec, st, e = bibrecord.create_record(self.xml_example_record_with_xyz, 1, 0) self.assertEqual(rec, {u'001': [([], " ", " ", '33', 1)], 'XYZ': [([('f', 'HB'), ('g', 'Let us see if this gets inserted well.')], " ", " ", "", 3)], '041': [([('a', 'eng')], " ", " ", "", 2)]}) self.assertEqual(bibrecord.record_get_field_values(rec, "041", " ", " ", "a"), ['eng']) self.assertEqual(bibrecord.record_get_field_values(rec, "XYZ", " ", " ", "f"), ['HB']) self.assertEqual(bibrecord.record_get_field_values(rec, "XYZ", " ", " ", "g"), ['Let us see if this gets inserted well.']) class BibRecordPrintingTest(unittest.TestCase): """ bibrecord - testing for printing record """ def setUp(self): # pylint: disable-msg=C0103 """Initialize stuff""" self.xml_example_record = """ 81 TEST-ARTICLE-2006-001 ARTICLE-2006-001 Test ti """ self.xml_example_record_short = """ 81 TEST-ARTICLE-2006-001 ARTICLE-2006-001 """ self.xml_example_multi_records = """ 81 TEST-ARTICLE-2006-001 ARTICLE-2006-001 Test ti 82 Author, t """ self.xml_example_multi_records_short = """ 81 TEST-ARTICLE-2006-001 ARTICLE-2006-001 82 """ def test_print_rec(self): """bibrecord - print rec""" rec, st, e = bibrecord.create_record(self.xml_example_record, 1, 1) rec_short, st_short, e_short = bibrecord.create_record(self.xml_example_record_short, 1, 1) self.assertEqual(bibrecord.create_record(bibrecord.print_rec(rec, tags=[]), 1, 1)[0], rec) self.assertEqual(bibrecord.create_record(bibrecord.print_rec(rec, tags=["001", "037"]), 1, 1)[0], rec_short) self.assertEqual(bibrecord.create_record(bibrecord.print_rec(rec, tags=["037"]), 1, 1)[0], rec_short) def test_print_recs(self): """bibrecord - print multiple recs""" list_of_recs = bibrecord.create_records(self.xml_example_multi_records, 1, 1) list_of_recs_elems = [elem[0] for elem in list_of_recs] list_of_recs_short = bibrecord.create_records(self.xml_example_multi_records_short, 1, 1) list_of_recs_short_elems = [elem[0] for elem in list_of_recs_short] self.assertEqual(bibrecord.create_records(bibrecord.print_recs(list_of_recs_elems, tags=[]), 1, 1), list_of_recs) self.assertEqual(bibrecord.create_records(bibrecord.print_recs(list_of_recs_elems, tags=["001", "037"]), 1, 1), list_of_recs_short) self.assertEqual(bibrecord.create_records(bibrecord.print_recs(list_of_recs_elems, tags=["037"]), 1, 1), list_of_recs_short) def create_test_suite(): """Return test suite for the bibrecord module""" return unittest.TestSuite((unittest.makeSuite(BibRecordSanityTest, 'test'), unittest.makeSuite(BibRecordSuccessTest, 'test'), unittest.makeSuite(BibRecordBadInputTreatmentTest, 'test'), unittest.makeSuite(BibRecordGettingFieldValuesTest, 'test'), unittest.makeSuite(BibRecordGettingFieldValuesViaWildcardsTest, 'test'), unittest.makeSuite(BibRecordAddFieldTest, 'test'), unittest.makeSuite(BibRecordDeleteFieldTest, 'test'), unittest.makeSuite(BibRecordAccentedUnicodeLettersTest, 'test'), unittest.makeSuite(BibRecordSpecialTagParsingTest, 'test'), unittest.makeSuite(BibRecordPrintingTest, 'test'), )) if __name__ == '__main__': unittest.TextTestRunner(verbosity=2).run(create_test_suite()) diff --git a/modules/bibedit/lib/refextract_config.py b/modules/bibedit/lib/refextract_config.py index ebc1b1025..7f3b858dc 100644 --- a/modules/bibedit/lib/refextract_config.py +++ b/modules/bibedit/lib/refextract_config.py @@ -1,75 +1,75 @@ # -*- coding: utf-8 -*- ## ## $Id$ ## ## This file is part of CDS Invenio. ## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 CERN. ## ## CDS Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## CDS Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with CDS Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """RefExtract configuration.""" __revision__ = "$Id$" -from invenio.config import version, etcdir, cdsname +from invenio.config import CFG_VERSION, CFG_ETCDIR, cdsname # pylint: disable-msg=C0301 # version number: -CFG_REFEXTRACT_VERSION = "CDS Invenio/%s refextract/%s" % (version, version) +CFG_REFEXTRACT_VERSION = "CDS Invenio/%s refextract/%s" % (CFG_VERSION, CFG_VERSION) # periodicals knowledge base: -CFG_REFEXTRACT_KB_JOURNAL_TITLES = "%s/bibedit/refextract-journal-titles.kb" % etcdir +CFG_REFEXTRACT_KB_JOURNAL_TITLES = "%s/bibedit/refextract-journal-titles.kb" % CFG_ETCDIR # report numbers knowledge base: -CFG_REFEXTRACT_KB_REPORT_NUMBERS = "%s/bibedit/refextract-report-numbers.kb" % etcdir +CFG_REFEXTRACT_KB_REPORT_NUMBERS = "%s/bibedit/refextract-report-numbers.kb" % CFG_ETCDIR ## MARC Fields and subfields used by refextract: ## reference fields: CFG_REFEXTRACT_CTRL_FIELD_RECID = "001" ## control-field recid CFG_REFEXTRACT_TAG_ID_REFERENCE = "999" ## ref field tag CFG_REFEXTRACT_IND1_REFERENCE = "C" ## ref field ind1 CFG_REFEXTRACT_IND2_REFERENCE = "5" ## ref field ind2 CFG_REFEXTRACT_SUBFIELD_MARKER = "o" ## ref marker subfield CFG_REFEXTRACT_SUBFIELD_MISC = "m" ## ref misc subfield CFG_REFEXTRACT_SUBFIELD_REPORT_NUM = "r" ## ref reportnum subfield CFG_REFEXTRACT_SUBFIELD_TITLE = "s" ## ref title subfield CFG_REFEXTRACT_SUBFIELD_URL = "u" ## ref url subfield CFG_REFEXTRACT_SUBFIELD_URL_DESCR = "z" ## ref url-text subfield ## refextract statisticts fields: CFG_REFEXTRACT_TAG_ID_EXTRACTION_STATS = "999" ## ref-stats tag CFG_REFEXTRACT_IND1_EXTRACTION_STATS = "C" ## ref-stats ind1 CFG_REFEXTRACT_IND2_EXTRACTION_STATS = "6" ## ref-stats ind2 CFG_REFEXTRACT_SUBFIELD_EXTRACTION_STATS = "a" ## ref-stats subfield ## Internal tags are used by refextract to mark-up recognised citation ## information. These are the "closing tags: CFG_REFEXTRACT_MARKER_CLOSING_REPORT_NUM = r"" CFG_REFEXTRACT_MARKER_CLOSING_TITLE = r"" CFG_REFEXTRACT_MARKER_CLOSING_SERIES = r"" CFG_REFEXTRACT_MARKER_CLOSING_VOLUME = r"" CFG_REFEXTRACT_MARKER_CLOSING_YEAR = r"" CFG_REFEXTRACT_MARKER_CLOSING_PAGE = r"" ## XML Record and collection opening/closing tags: CFG_REFEXTRACT_XML_VERSION = u"""""" CFG_REFEXTRACT_XML_COLLECTION_OPEN = u"""""" CFG_REFEXTRACT_XML_COLLECTION_CLOSE = u"""\n""" CFG_REFEXTRACT_XML_RECORD_OPEN = u"" CFG_REFEXTRACT_XML_RECORD_CLOSE = u"" diff --git a/modules/bibformat/lib/bibformat_bfx_engine.py b/modules/bibformat/lib/bibformat_bfx_engine.py index aa17be5d3..de176a104 100644 --- a/modules/bibformat/lib/bibformat_bfx_engine.py +++ b/modules/bibformat/lib/bibformat_bfx_engine.py @@ -1,1260 +1,1260 @@ ## $Id$ ## ## This file is part of CDS Invenio. ## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 CERN. ## ## CDS Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## CDS Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -## General Public License for more details. +## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with CDS Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """ -BFX formatting engine. +BFX formatting engine. For API: see format_with_bfx() docstring below. """ __revision__ = "$Id$" import re import copy as p_copy from xml.dom import minidom, Node from xml.sax import saxutils from invenio.bibformat_engine import BibFormatObject, get_format_element, eval_format_element -from invenio.bibformat_bfx_engine_config import CFG_BIBFORMAT_BFX_LABEL_DEFINITIONS, CFG_BIBFORMAT_BFX_TEMPLATES_PATH +from invenio.bibformat_bfx_engine_config import CFG_BIBFORMAT_BFX_LABEL_DEFINITIONS, CFG_BIBFORMAT_BFX_TEMPLATES_PATH from invenio.bibformat_bfx_engine_config import CFG_BIBFORMAT_BFX_FORMAT_TEMPLATE_EXTENSION, CFG_BIBFORMAT_BFX_ELEMENT_NAMESPACE from invenio.bibformat_bfx_engine_config import CFG_BIBFORMAT_BFX_ERROR_MESSAGES, CFG_BIBFORMAT_BFX_WARNING_MESSAGES address_pattern = r'(?P[a-z_]*):?/?(?P[0-9_?\w]*)/?(?P[\w_?]?)#?(?P.*)' def format_with_bfx(recIDs, out_file, template_name, preprocess=None): ''' Format a set of records according to a BFX template. This is the main entry point to the BFX engine. - + @param recIDs a list of record IDs to format @param out_file an object to write in; this can be every object which has a 'write' method: file, req, StringIO @param template_name the file name of the BFX template without the path and the .bfx extension @param preprocess an optional function; every record is passed through this function for initial preprocessing before formatting ''' trans = MARCTranslator(CFG_BIBFORMAT_BFX_LABEL_DEFINITIONS) trans.set_record_ids(recIDs, preprocess) parser = BFXParser(trans) template_tree = parser.load_template(template_name) parser.walk(template_tree, out_file) return None class BFXParser: ''' A general-purpose parser for generating xml/xhtml/text output based on a template system. Must be initialised with a translator. A translator is like a blackbox that returns values, calls functions, etc... Works with every translator supporting the following simple interface: - is_defined(name) - get_value(name) - iterator(name) - call_function(func_name, list_of_parameters) Customized for MARC to XML conversion through the use of a MARCTranslator. - Templates are strict XML files. They are built by combining any tags with the + Templates are strict XML files. They are built by combining any tags with the special BFX tags living in the http://cdsware.cern.ch/invenio/ namespace. Easily extensible by tags of your own. Defined tags: - template: defines a template - template_ref: a reference to a template - loop structure - if, then, elif, else structure - text: output text - field: query translator for field 'name' - element: call external functions ''' def __init__(self, translator): ''' Create an instance of the BFXParser class. Initialize with a translator. The BFXparser makes queries to the translator for the values of certain names. For the communication it uses the following translator methods: - is_defined(name) - iterator(name) - get_value(name, [display_specifier]) @param translator the translator used by the class instance ''' self.translator = translator self.known_operators = ['style', 'format', 'template', 'template_ref', 'text', 'field', 'element', 'loop', 'if', 'then', 'else', 'elif'] self.flags = {} # store flags here; self.templates = {} # store templates and formats here - self.start_template_name = None #the name of the template from which the 'execution' starts; + self.start_template_name = None #the name of the template from which the 'execution' starts; #this is usually a format or the only template found in a doc def load_template(self, template_name, template_source=None): ''' Load a BFX template file. A template file can have one of two forms: - it is a file with a single template. Root tag is 'template'. In an API call the single template element is 'executed'. - it is a 'style' file which contains exactly one format and zero or more templates. Root tag is 'style' with children 'format' and 'template'(s). In this case only the format code is 'executed'. Naturally, in it, it would have references to other templates in the document. Template can be given by name (in that case search path is in standard directory for bfx template) or directly using the template source. If given, template_source overrides template_name - + @param template_name the name of the BFX template, the same as the name of the filename without the extension @return a DOM tree of the template ''' if template_source is None: template_file_name = CFG_BIBFORMAT_BFX_TEMPLATES_PATH + '/' + template_name + '.' + CFG_BIBFORMAT_BFX_FORMAT_TEMPLATE_EXTENSION #load document doc = minidom.parse(template_file_name) else: doc = minidom.parseString(template_source) #set exec flag to false and walk document to find templates and formats self.flags['exec'] = False self.walk(doc) #check found templates if self.start_template_name: start_template = self.templates[self.start_template_name]['node'] else: #print CFG_BIBFORMAT_BFX_WARNING_MESSAGES['WRN_BFX_NO_FORMAT_FOUND'] if len(self.templates) == 1: # no format found, check if there is a default template self.start_template_name = self.templates.keys()[0] start_template = self.templates[self.start_template_name]['node'] else: #no formats found, templates either zero or more than one if len(self.templates) > 1: print CFG_BIBFORMAT_BFX_ERROR_MESSAGES['ERR_BFX_TOO_MANY_TEMPLATES'] #else: # print CFG_BIBFORMAT_BFX_ERROR_MESSAGES['ERR_BFX_NO_TEMPLATES_FOUND'] return None self.flags['exec'] = True return start_template def parse_attribute(self, expression): ''' A function to check if an expression is of the special form [!name:display]. A short form for saying , used in element attributes. @param expression a string, usually taken from an attribute value @return if the string is special, parse it and return the corresponding value; else return the initial expression ''' output = expression pattern = '\[!(?P[\w_.:]*)\]' expr = re.compile(pattern) match = expr.match(expression) if match: tmp = match.group('tmp') tmp = tmp.split(':') var = tmp[0] display = '' if len(tmp) == 2: display = tmp[1] output = self.translator.get_value(var, display) output = xml_escape(output) return output - + def walk(self, parent, out_file=None): ''' Walk a template DOM tree. The main function in the parser. It is recursively called until all the nodes are processed. This function is used in two different ways: - for initial loading of the template (and validation) - for 'execution' of a format/template The different behaviour is achieved through the use of flags, which can be set to True or False. @param parent a node to process; in an API call this is the root node @param out_file an object to write to; must have a 'write' method - + @return None ''' for node in parent.childNodes: if node.nodeType == Node.TEXT_NODE: value = get_node_value(node) value = value.strip() if out_file: out_file.write(value) if node.nodeType == Node.ELEMENT_NODE: #get values name, attributes, element_namespace = get_node_name(node), get_node_attributes(node), get_node_namespace(node) # write values if element_namespace != CFG_BIBFORMAT_BFX_ELEMENT_NAMESPACE: #parse all the attributes for key in attributes.keys(): attributes[key] = self.parse_attribute(attributes[key]) if node_has_subelements(node): if out_file: out_file.write(create_xml_element(name=name, attrs=attributes, element_type=xmlopen)) self.walk(node, out_file) #walk subnodes if out_file: out_file.write(create_xml_element(name=name, element_type=xmlclose)) else: if out_file: out_file.write(create_xml_element(name=name, attrs=attributes, element_type=xmlempty)) #name is a special name, must fall in one of the next cases: elif node.localName == 'style': self.ctl_style(node, out_file) elif node.localName == 'format': self.ctl_format(node, out_file) elif node.localName == 'template': self.ctl_template(node, out_file) elif node.localName == 'template_ref': self.ctl_template_ref(node, out_file) elif node.localName == 'element': self.ctl_element(node, out_file) elif node.localName == 'field': self.ctl_field(node, out_file) elif node.localName == 'text': self.ctl_text(node, out_file) elif node.localName == 'loop': self.ctl_loop(node, out_file) elif node.localName == 'if': self.ctl_if(node, out_file) elif node.localName == 'then': self.ctl_then(node, out_file) elif node.localName == 'else': self.ctl_else(node, out_file) elif node.localName == 'elif': self.ctl_elif(node, out_file) else: if node.localName in self.known_operators: print 'Note for programmer: you haven\'t implemented operator %s.' % (name) else: print CFG_BIBFORMAT_BFX_ERROR_MESSAGES['ERR_BFX_INVALID_OPERATOR_NAME'] % (name) return None def ctl_style(self, node, out_file): ''' Process a style root node. ''' #exec mode if self.flags['exec']: return None #test mode self.walk(node, out_file) return None def ctl_format(self, node, out_file): ''' Process a format node. Get name, description and content attributes. This function is called only in test mode. ''' #exec mode if self.flags['exec']: return None #test mode attrs = get_node_attributes(node) #get template name and give control to ctl_template if attrs.has_key('name'): name = attrs['name'] if self.templates.has_key(name): print CFG_BIBFORMAT_BFX_ERROR_MESSAGES['ERR_BFX_DUPLICATE_NAME'] % (name) return None self.start_template_name = name self.ctl_template(node, out_file) else: print CFG_BIBFORMAT_BFX_ERROR_MESSAGES['ERR_BFX_TEMPLATE_NO_NAME'] return None return None - + def ctl_template(self, node, out_file): ''' Process a template node. Get name, description and content attributes. Register name and store for later calls from template_ref. This function is called only in test mode. ''' #exec mode if self.flags['exec']: return None #test mode attrs = get_node_attributes(node) #get template name if attrs.has_key('name'): name = attrs['name'] if self.templates.has_key(name): print CFG_BIBFORMAT_BFX_ERROR_MESSAGES['ERR_BFX_DUPLICATE_NAME'] % (name) return None self.templates[name] = {} self.templates[name]['node'] = node else: print CFG_BIBFORMAT_BFX_ERROR_MESSAGES['ERR_BFX_TEMPLATE_NO_NAME'] return None #get template description if attrs.has_key('description'): description = attrs['description'] else: description = '' print CFG_BIBFORMAT_BFX_WARNING_MESSAGES['WRN_BFX_TEMPLATE_NO_DESCRIPTION'] self.templates[name]['description'] = description #get content-type of resulting output if attrs.has_key('content'): content_type = attrs['content'] else: content_type = 'text/xml' print CFG_BIBFORMAT_BFX_WARNING_MESSAGES['WRN_BFX_TEMPLATE_NO_CONTENT'] self.templates[name]['content_type'] = content_type #walk node self.walk(node, out_file) return None - + def ctl_template_ref(self, node, out_file): ''' Reference to an external template. This function is called only in execution mode. Bad references appear as run-time errors. ''' #test mode if not self.flags['exec']: return None #exec mode attrs = get_node_attributes(node) - if not attrs.has_key('name'): + if not attrs.has_key('name'): print CFG_BIBFORMAT_BFX_ERROR_MESSAGES['ERR_BFX_TEMPLATE_REF_NO_NAME'] return None name = attrs['name'] #first check for a template in the same file, that is in the already cached templates if self.templates.has_key(name): node_to_walk = self.templates[name]['node'] self.walk(node_to_walk, out_file) else: #load a file and execute it pass - #template_file_name = CFG_BIBFORMAT_BFX_TEMPLATES_PATH + name + '/' + CFG_BIBFORMAT_BFX_FORMAT_TEMPLATE_EXTENSION + #template_file_name = CFG_BIBFORMAT_BFX_TEMPLATES_PATH + name + '/' + CFG_BIBFORMAT_BFX_FORMAT_TEMPLATE_EXTENSION #try: # node = minidom.parse(template_file_name) #except: # print CFG_BIBFORMAT_BFX_ERROR_MESSAGES['ERR_BFX_TEMPLATE_NOT_FOUND'] % (template_file_name) return None - + def ctl_element(self, node, out_file): ''' Call an external element (written in Python). ''' #test mode if not self.flags['exec']: return None #exec mode parameters = get_node_attributes(node) if not parameters.has_key('name'): print CFG_BIBFORMAT_BFX_ERROR_MESSAGES['ERR_BFX_ELEMENT_NO_NAME'] return None function_name = parameters['name'] del parameters['name'] #now run external bfe_name.py, with param attrs if function_name: value = self.translator.call_function(function_name, parameters) value = xml_escape(value) out_file.write(value) return None - + def ctl_field(self, node, out_file): ''' Get the value of a field by its name. ''' #test mode - if not self.flags['exec']: + if not self.flags['exec']: return None #exec mode attrs = get_node_attributes(node) if not attrs.has_key('name'): print CFG_BIBFORMAT_BFX_ERROR_MESSAGES['ERR_BFX_FIELD_NO_NAME'] return None display = '' if attrs.has_key('display'): display = attrs['display'] var = attrs['name'] if not self.translator.is_defined(var): print CFG_BIBFORMAT_BFX_ERROR_MESSAGES['ERR_BFX_NO_SUCH_FIELD'] % (var) return None value = self.translator.get_value(var, display) value = xml_escape(value) out_file.write(value) return None def ctl_text(self, node, out_file): ''' Output a text ''' #test mode - if not self.flags['exec']: + if not self.flags['exec']: return None #exec mode attrs = get_node_attributes(node) if not attrs.has_key('value'): print CFG_BIBFORMAT_BFX_ERROR_MESSAGES['ERR_BFX_TEXT_NO_VALUE'] return None value = attrs['value'] value = value.replace(r'\n', '\n') #value = xml_escape(value) if type(value) == type(u''): value = value.encode('utf-8') out_file.write(value) return None def ctl_loop(self, node, out_file): ''' Loop through a set of values. ''' #test mode - if not self.flags['exec']: + if not self.flags['exec']: self.walk(node, out_file) return None #exec mode attrs = get_node_attributes(node) if not attrs.has_key('object'): print CFG_BIBFORMAT_BFX_ERROR_MESSAGES['ERR_BFX_LOOP_NO_OBJECT'] return None name = attrs['object'] if not self.translator.is_defined(name): print CFG_BIBFORMAT_BFX_ERROR_MESSAGES['ERR_BFX_NO_SUCH_FIELD'] % (name) return None for new_object in self.translator.iterator(name): self.walk(node, out_file) return None def ctl_if(self, node, out_file): ''' An if/then/elif/.../elif/else construct. 'If' can have several forms: : True if var is non-empty, eval as string : True if var=value, eval as string : True if var : True if var>value, try to eval as num, else eval as string : True if var<=value, try to eval as num, else eval as string : True if var>=value, try to eval as num, else eval as string : True if var in [val1, val2], eval as string : True if var not in [val1, val2], eval as string : True if var!=value, eval as string : Match against a regular expression - + Example: Pauli Pauli other ''' #test mode if not self.flags['exec']: self.walk(node, out_file) return None #exec mode attrs = get_node_attributes(node) if not attrs.has_key('name'): print CFG_BIBFORMAT_BFX_ERROR_MESSAGES['ERR_BFX_IF_NO_NAME'] return None - #determine result + #determine result var = attrs['name'] if not self.translator.is_defined(var): print CFG_BIBFORMAT_BFX_ERROR_MESSAGES['ERR_BFX_NO_SUCH_FIELD'] % (var) - return None + return None value = self.translator.get_value(var) value = value.strip() #equal if attrs.has_key('eq'): pattern = attrs['eq'] if is_number(pattern) and is_number(value): result = (float(value)==float(pattern)) else: result = (value==pattern) #not equal elif attrs.has_key('neq'): pattern = attrs['neq'] if is_number(pattern) and is_number(value): result = (float(value)!=float(pattern)) else: - result = (value!=pattern) + result = (value!=pattern) #lower than elif attrs.has_key('lt'): pattern = attrs['lt'] if is_number(pattern) and is_number(value): result = (float(value)float(pattern)) else: result = (value>pattern) #lower or equal than elif attrs.has_key('le'): pattern = attrs['le'] if is_number(pattern) and is_number(value): result = (float(value)<=float(pattern)) else: result = (value<=pattern) #greater or equal than elif attrs.has_key('ge'): pattern = attrs['ge'] if is_number(pattern) and is_number(value): result = (float(value)>=float(pattern)) else: result = (value>=pattern) #in elif attrs.has_key('in'): pattern = attrs['in'] values = pattern.split() result = (value in values) #not in elif attrs.has_key('nin'): pattern = attrs['nin'] values = pattern.split() result = (value not in values) #match against a regular expression elif attrs.has_key('like'): pattern = attrs['like'] try: expr = re.compile(pattern) result = expr.match(value) except: print CFG_BIBFORMAT_BFX_ERROR_MESSAGES['ERR_BFX_INVALID_RE'] % (pattern) #simple form: True if non-empty, otherwise False else: result = value #end of evaluation #================= #validate subnodes then_node = get_node_subelement(node, 'then', CFG_BIBFORMAT_BFX_ELEMENT_NAMESPACE) else_node = get_node_subelement(node, 'else', CFG_BIBFORMAT_BFX_ELEMENT_NAMESPACE) elif_node = get_node_subelement(node, 'elif', CFG_BIBFORMAT_BFX_ELEMENT_NAMESPACE) #having else and elif siblings at the same time is a syntax error if (else_node is not None) and (elif_node is not None): print CFG_BIBFORMAT_BFX_ERROR_MESSAGES['ERR_BFX_IF_WRONG_SYNTAX'] return None - #now walk appropriate nodes, according to the result + #now walk appropriate nodes, according to the result if result: #True if then_node: self.walk(then_node, out_file) #todo: add short form, without 'then', just elements within if statement to walk on 'true' and no 'elif' or 'else' elements else: #False if elif_node: self.ctl_if(elif_node, out_file) elif else_node: self.walk(else_node, out_file) return None def ctl_then(self, node, out_file): ''' Calling 'then' directly from the walk function means a syntax error. ''' #test mode if not self.flags['exec']: self.walk(node, out_file) return None #exec mode print CFG_BIBFORMAT_BFX_ERROR_MESSAGES['ERR_BFX_IF_WRONG_SYNTAX'] return None - + def ctl_else(self, node, out_file): ''' Calling 'else' directly from the walk function means a syntax error. ''' #test mode if not self.flags['exec']: self.walk(node, out_file) return None #exec mode print CFG_BIBFORMAT_BFX_ERROR_MESSAGES['ERR_BFX_IF_WRONG_SYNTAX'] return None - + def ctl_elif(self, node, out_file): ''' Calling 'elif' directly from the walk function means a syntax error. ''' #test mode if not self.flags['exec']: self.walk(node, out_file) return None - #exec mode + #exec mode print CFG_BIBFORMAT_BFX_ERROR_MESSAGES['ERR_BFX_IF_WRONG_SYNTAX'] return None - - + + class MARCTranslator: ''' memory[name] [name]['addresses'] - the set of rules for each of the defined names [name]['parent'] - the name of the parent; '' if none; [name]['children'] - a list with the name of the children of every variable [name]['object'] - stored state of object for performance efficiency ''' def __init__(self, labels=None): ''' Create an instance of the translator and init with the list of the defined labels and their rules. ''' if labels is None: labels = {} self.recIDs = [] self.recID = 0 self.recID_index = 0 self.record = None self.memory = {} pattern = address_pattern - expr = re.compile(pattern) + expr = re.compile(pattern) for name in labels.keys(): self.memory[name] = {} self.memory[name]['object'] = None self.memory[name]['parent'] = '' self.memory[name]['children'] = [] self.memory[name]['addresses'] = p_copy.deepcopy(labels[name]) for name in self.memory: for i in range(len(self.memory[name]['addresses'])): address = self.memory[name]['addresses'][i] match = expr.match(address) if not match: print 'Invalid address: ', name, address else: parent_name = match.group('parent') if parent_name: if not self.memory.has_key(parent_name): print CFG_BIBFORMAT_BFX_ERROR_MESSAGES['ERR_BFX_NO_SUCH_FIELD'] % (parent_name) else: self.memory[name]['parent'] = parent_name #now make parent aware of children if not name in self.memory[parent_name]['children']: self.memory[parent_name]['children'].append(name) level = self.determine_level(parent_name) self.memory[name]['addresses'][i] = self.memory[name]['addresses'][i].replace(parent_name, '/'*level) #special case 'record' self.memory['record'] = {} self.memory['record']['object'] = None self.memory['record']['parent'] = '' self.memory['record']['children'] = [] def set_record_ids(self, recIDs, preprocess=None): ''' Initialize the translator with the set of record IDs. @param recIDs a list of the record IDs @param preprocess an optional function which acts on every record structure after creating it - This can be used to enrich the record with fields not present in the record initially, + This can be used to enrich the record with fields not present in the record initially, verify the record data or whatever plausible. Another solution is to use external function elements. ''' - self.record = None + self.record = None self.recIDs = recIDs self.preprocess = preprocess if self.recIDs: self.recID_index = 0 self.recID = self.recIDs[self.recID_index] self.record = get_record(self.recID) if self.preprocess: self.preprocess(self.record) - return None - + return None + def determine_level(self, name): ''' Determine the type of the variable, whether this is an instance or a subfield. This is done by observing the first provided address for the name. todo: define variable types in config file, remove this function, results in a clearer concept ''' level = 0 #default value if self.memory.has_key(name): expr = re.compile(address_pattern) if self.memory[name]['addresses']: match = expr.match(self.memory[name]['addresses'][0]) if match: - tag = match.group('tag') + tag = match.group('tag') code = match.group('code') reg = match.group('reg') if reg: level = 2 #subfield elif code: level = 2 #subfield elif tag: level = 1 #instance return level #======================================== #API functions for quering the translator #======================================== def is_defined(self, name): ''' Check whether a variable is defined. @param name the name of the variable ''' return self.memory.has_key(name) def get_num_elements(self, name): ''' An API function to get the number of elements for a variable. Do not use this function to build loops, Use iterator instead. ''' if name == 'record': return len(self.recIDs) num = 0 for part in self.iterator(name): num = num + 1 return num def get_value(self, name, display_type='value'): ''' The API function for quering the translator for values of a certain variable. Called in a loop will result in a different value each time. Objects are cached in memory, so subsequent calls for the same variable take less time. @param name the name of the variable you want the value of @param display_type an optional value for the type of the desired output, one of: value, tag, ind1, ind2, code, fulltag; These can be easily added in the proper place of the code (display_value) ''' if name == 'record': return '' record = self.get_object(name) return self.display_record(record, display_type) - + def iterator(self, name): ''' An iterator over the values of a certain name. The iterator changes state of internal variables and objects. When calling get_value in a loop, this will result each time in a different value. ''' if name == 'record': for self.recID in self.recIDs: self.record = get_record(self.recID) if self.preprocess: self.preprocess(self.record) yield str(self.recID) else: full_object = self.build_object(name) level = self.determine_level(name) for new_object in record_parts(full_object, level): self.memory[name]['object'] = new_object #parent has changed state; also set childs state to None; for children_name in self.memory[name]['children']: self.memory[children_name]['object'] = None yield new_object #the result for a call of the same name after an iterator should be the same as if there was no iterator called before self.memory[name]['object'] = None - + def call_function(self, function_name, parameters=None): ''' Call an external element which is a Python file, using BibFormat @param function_name the name of the function to call @param parameters a dictionary of the parameters to pass as key=value pairs @return a string value, which is the result of the function call ''' if parameters is None: parameters = {} bfo = BibFormatObject(self.recID) format_element = get_format_element(function_name) (value, errors) = eval_format_element(format_element, bfo, parameters) #to do: check errors from function call return value - + #======================================== #end of API functions #======================================== def get_object(self, name): ''' Responsible for creating the desired object, corresponding to provided name. If object is not cached in memory, it is build again. Directly called by API function get_value. The result is then formatted by display_record according to display_type. ''' if self.memory[name]['object'] is not None: return self.memory[name]['object'] new_object = self.build_object(name) #if you have reached here you are not in an iterator; return first non-empty - level = self.determine_level(name) + level = self.determine_level(name) for tmp_object in record_parts(new_object, level): #get the first non-empty if tmp_object: new_object = tmp_object break self.memory[name]['object'] = new_object return new_object def build_object(self, name): ''' Build the object from the list of addresses A slave function for get_object. ''' new_object = {} - parent_name = self.memory[name]['parent']; + parent_name = self.memory[name]['parent']; has_parent = parent_name - for address in self.memory[name]['addresses']: + for address in self.memory[name]['addresses']: if not has_parent: tmp_object = copy(self.record, address) new_object = merge(new_object, tmp_object) else: #has parent parent_object = self.get_object(parent_name) #already returns the parents instance tmp_object = copy(parent_object, address) new_object = merge(new_object, tmp_object) return new_object - - + + def display_record(self, record, display_type='value'): ''' Decide what the final output value is according to the display_type. @param record the record structure to display; this is most probably just a single subfield @param display_type a string specifying the desired output; can be one of: value, tag, ind1, ind2, code, fulltag @return a string to output ''' output = '' tag, ind1, ind2, code, value = '', '', '', '', '' if record: tags = record.keys() tags.sort() if tags: fulltag = tags[0] tag, ind1, ind2 = fulltag[0:3], fulltag[3:4], fulltag[4:5] field_instances = record[fulltag] if field_instances: field_instance = field_instances[0] codes = field_instance.keys() codes.sort() if codes: code = codes[0] value = field_instance[code] if not display_type: display_type = 'value' if display_type == 'value': output = value elif display_type == 'tag': output = tag elif display_type == 'ind1': ind1 = ind1.replace('_', ' ') output = ind1 elif display_type=='ind2': ind2 = ind2.replace('_', ' ') output = ind2 elif display_type == 'code': output = code elif display_type == 'fulltag': output = tag + ind1 + ind2 else: print CFG_BIBFORMAT_BFX_ERROR_MESSAGES['ERR_BFX_INVALID_DISPLAY_TYPE'] % (display_type) - return output + return output ''' Functions for use with the structure representing a MARC record defined here. This record structure differs from the one defined in bibrecord. The reason is that we want a symmetry between controlfields and datafields. In this format controlfields are represented internally as a subfield value with code ' ' of a datafield. This allows for easier handling of the fields. -However, there is a restriction associated with this structure and it is that subfields cannot be repeated +However, there is a restriction associated with this structure and it is that subfields cannot be repeated in the same instance. If this is the case, the result will be incorrect. The record structure has the form: fields={field_tag:field_instances} field_instances=[field_instance] field_instance={field_code:field_value} ''' def convert_record(old_record): ''' Convert a record from the format defined in bibrecord to the format defined here @param old_record the record as returned from bibrecord.create_record() @return a record of the new form ''' fields = {} old_tags = old_record.keys() old_tags.sort() for old_tag in old_tags: if int(old_tag) < 11: - #controlfields + #controlfields new_tag = old_tag fields[new_tag] = [{' ':old_record[old_tag][0][3]}] else: #datafields old_field_instances = old_record[old_tag] num_fields = len(old_field_instances) for i in range(num_fields): old_field_instance = old_field_instances[i] ind1 = old_field_instance[1] if not ind1 or ind1 == ' ': ind1 = '_' ind2 = old_field_instance[2] if not ind2 or ind2 == ' ': ind2 = '_' new_tag = old_tag + ind1 + ind2 new_field_instance = {} for old_subfield in old_field_instance[0]: new_code = old_subfield[0] new_value = old_subfield[1] if new_field_instance.has_key(new_code): print 'Error: Repeating subfield codes in the same instance!' new_field_instance[new_code] = new_value if not fields.has_key(new_tag): fields[new_tag] = [] fields[new_tag].append(new_field_instance) return fields def get_record(recID): ''' Get a record with a specific recID. @param recID the ID of the record @return a record in the structure defined here ''' bfo = BibFormatObject(recID) return convert_record(bfo.get_record()) def print_record(record): ''' Print a record. ''' tags = record.keys() tags.sort() for tag in tags: field_instances = record[tag] for field_instance in field_instances: print tag, field_instance def record_fields_value(record, tag, subfield): ''' Return a list of all the fields with a certain tag and subfield code. - Works on subfield level. + Works on subfield level. @param record a record @param tag a 3 or 5 letter tag; required @param subfield a subfield code; required ''' output = [] if record.has_key(tag): for field_instance in record[tag]: if field_instance.has_key(subfield): output.append(field_instance[subfield]) return output def record_add_field_instance(record, tag, field_instance): ''' Add a field_instance to the beginning of the instances of a corresponding tag. @param record a record @param tag a 3 or 5 letter tag; required @param field_instance the field instance to add @return None ''' if not record.has_key(tag): record[tag] = [] record[tag] = [field_instance] + record[tag] return None def record_num_parts(record, level): ''' Count the number of instances or the number of subfields in the whole record. @param record @param level either 1 or 2 level=1 - view record on instance level level=2 - view record on subfield level @return the number of parts ''' num = 0 for part in record_parts(record, level): num = num + 1 def record_parts(record, level): ''' An iterator over the instances or subfields of a record. @param record @param level either 1 or 2 level=1 - iterate over instances level=2 - iterate over subfields @yield a record structure representing the part (instance or subfield) - ''' + ''' if level == 1: names = record.keys() names.sort() for name in names: old_field_instances = record[name] for old_field_instance in old_field_instances: new_record = {} - new_field_instances = [] + new_field_instances = [] new_field_instance = {} for old_field_code in old_field_instance.keys(): new_field_code = old_field_code new_field_value = old_field_instance[old_field_code] new_field_instance[new_field_code] = new_field_value new_field_instances.append(new_field_instance) new_record[name] = [] new_record[name].extend(new_field_instances) - yield new_record + yield new_record if level == 2: names = record.keys() names.sort() for name in names: old_field_instances = record[name] for old_field_instance in old_field_instances: old_field_codes = old_field_instance.keys() old_field_codes.sort() for old_field_code in old_field_codes: new_record = {} new_field_instances = [] new_field_instance = {} new_field_code = old_field_code new_field_value = old_field_instance[old_field_code] new_field_instance[new_field_code] = new_field_value new_field_instances.append(new_field_instance) new_record[name] = [] new_record[name].extend(new_field_instances) - yield new_record + yield new_record def copy(old_record, address=''): ''' Copy a record by filtering all parts of the old record specified by address (A better name for the function is filter.) @param record the initial record @param address an address; for examples see bibformat_bfx_engine_config. If no address is specified, return the initial record. @return the filtered record ''' if not old_record: return {} tag_pattern, code_pattern, reg_pattern = '', '', '' expr = re.compile(address_pattern) match = expr.match(address) if match: tag_pattern = match.group('tag') code_pattern = match.group('code') reg_pattern = match.group('reg') if tag_pattern: tag_pattern = tag_pattern.replace('?','[0-9_\w]') else: tag_pattern = r'.*' if code_pattern: code_pattern = code_pattern.replace('?','[\w ]') else: - code_pattern = r'.*' + code_pattern = r'.*' tag_expr = re.compile(tag_pattern) code_expr = re.compile(code_pattern) new_record = {} for tag in old_record.keys(): tag_match = tag_expr.match(tag) if tag_match: if tag_match.end() == len(tag): old_field_instances = old_record[tag] new_field_instances = [] for old_field_instance in old_field_instances: new_field_instance = {} for old_field_code in old_field_instance.keys(): new_field_code = old_field_code code_match = code_expr.match(new_field_code) if code_match: new_field_value = old_field_instance[old_field_code] new_field_instance[new_field_code] = new_field_value - if new_field_instance: + if new_field_instance: new_field_instances.append(new_field_instance) if new_field_instances: new_record[tag] = new_field_instances #in new_record pass all subfields through regexp if reg_pattern: for tag in new_record: field_instances = new_record[tag] for field_instance in field_instances: field_codes = field_instance.keys() for field_code in field_codes: field_instance[field_code] = pass_through_regexp(field_instance[field_code], reg_pattern) return new_record def merge(record1, record2): ''' Merge two records. Controlfields with the same tag in record2 as in record1 are ignored. @param record1, record2 @return the merged record ''' new_record = {} if record1: new_record = copy(record1) if not record2: return new_record for tag in record2.keys(): #append only datafield tags; #if controlfields conflict, leave first; old_field_instances = record2[tag] new_field_instances = [] for old_field_instance in old_field_instances: new_field_instance = {} for old_field_code in old_field_instance.keys(): new_field_code = old_field_code new_field_value = old_field_instance[old_field_code] new_field_instance[new_field_code] = new_field_value - if new_field_instance: + if new_field_instance: new_field_instances.append(new_field_instance) if new_field_instances: #controlfield if len(tag) == 3: if not new_record.has_key(tag): new_record[tag] = [] new_record[tag].extend(new_field_instances) #datafield if len(tag) == 5: if not new_record.has_key(tag): new_record[tag] = [] new_record[tag].extend(new_field_instances) return new_record #====================== #Help functions #===================== xmlopen = 1 xmlclose = 2 xmlfull = 3 xmlempty = 4 def create_xml_element(name, value='', attrs=None, element_type=xmlfull, level=0): ''' Create a XML element as string. @param name the name of the element @param value the element value; default is '' @param attrs a dictionary with the element attributes @param element_type a constant which defines the type of the output xmlopen = 1 xmlclose = 2 xmlfull = 3 value xmlempty = 4 @return a formatted XML string ''' output = '' if attrs is None: attrs = {} - if element_type == xmlempty: + if element_type == xmlempty: output += '<'+name for attrname in attrs.keys(): attrvalue = attrs[attrname] if type(attrvalue) == type(u''): attrvalue = attrvalue.encode('utf-8') output += ' %s="%s"' % (attrname, attrvalue) output += ' />' if element_type == xmlfull: output += '<'+name for attrname in attrs.keys(): attrvalue = attrs[attrname] if type(attrvalue) == type(u''): attrvalue = attrvalue.encode('utf-8') output += ' %s="%s"' % (attrname, attrvalue) output += '>' output += value output += '' if element_type == xmlopen: output += '<'+name for attrname in attrs.keys(): output += ' '+attrname+'="'+attrs[attrname]+'"' output += '>' if element_type == xmlclose: output += '' output = ' '*level + output if type(output) == type(u''): output = output.encode('utf-8') return output def xml_escape(value): ''' Escape a string value for use as a xml element or attribute value. @param value the string value to escape @return escaped value ''' return saxutils.escape(value) def xml_unescape(value): ''' Unescape a string value for use as a xml element. @param value the string value to unescape @return unescaped value ''' return saxutils.unescape(value) - + def node_has_subelements(node): ''' Check if a node has any childnodes. Check for element or text nodes. @return True if childnodes exist, False otherwise. ''' result = False for node in node.childNodes: if node.nodeType == Node.ELEMENT_NODE or node.nodeType == Node.TEXT_NODE: result = True return result def get_node_subelement(parent_node, name, namespace = None): ''' Get the first childnode with specific name and (optional) namespace @param parent_node the node to check @param name the name to search @param namespace An optional namespace URI. This is usually a URL: http://cdsware.cern.ch/invenio/ @return the found node; None otherwise ''' output = None for node in parent_node.childNodes: if node.nodeType == Node.ELEMENT_NODE and node.localName == name and node.namespaceURI == namespace: output = node return output return output def get_node_value(node): ''' Get the node value of a node. For use with text nodes. @param node a text node @return a string of the nodevalue encoded in utf-8 ''' return node.nodeValue.encode('utf-8') def get_node_namespace(node): ''' Get node namespace. For use with element nodes. @param node an element node @return the namespace of the node ''' return node.namespaceURI def get_node_name(node): ''' Get the node value of a node. For use with element nodes. @param node an element node @return a string of the node name - ''' + ''' return node.nodeName def get_node_attributes(node): ''' Get attributes of an element node. For use with element nodes @param node an element node @return a dictionary of the attributes as key:value pairs ''' attributes = {} attrs = node.attributes for attrname in attrs.keys(): attrnode = attrs.get(attrname) attrvalue = attrnode.nodeValue attributes[attrname] = attrvalue return attributes def pass_through_regexp(value, regexp): ''' Pass a value through a regular expression. @param value a string @param regexp a regexp with a group 'value' in it. No group named 'value' will result in an error. @return if the string matches the regexp, return named group 'value', otherwise return '' ''' output = '' expr = re.compile(regexp) match = expr.match(value) if match: output = match.group('value') return output def is_number(value): ''' Check if a value is a number. @param value the value to check @return True or False ''' result = True try: float(value) except ValueError: result = False return result - + diff --git a/modules/bibformat/lib/bibformat_bfx_engine_config.py b/modules/bibformat/lib/bibformat_bfx_engine_config.py index 56c7828e5..6cab88bdf 100644 --- a/modules/bibformat/lib/bibformat_bfx_engine_config.py +++ b/modules/bibformat/lib/bibformat_bfx_engine_config.py @@ -1,117 +1,117 @@ ## $Id$ ## ## This file is part of CDS Invenio. ## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 CERN. ## ## CDS Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## CDS Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -## General Public License for more details. +## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with CDS Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. # pylint: disable-msg=C0301 """BibFormat BFX engine configuration.""" __revision__ = "$Id$" import os -from invenio.config import etcdir +from invenio.config import CFG_ETCDIR -CFG_BIBFORMAT_BFX_TEMPLATES_PATH = "%s%sbibformat%sformat_templates" % (etcdir, os.sep, os.sep) +CFG_BIBFORMAT_BFX_TEMPLATES_PATH = "%s%sbibformat%sformat_templates" % (CFG_ETCDIR, os.sep, os.sep) CFG_BIBFORMAT_BFX_FORMAT_TEMPLATE_EXTENSION = "bfx" CFG_BIBFORMAT_BFX_ELEMENT_NAMESPACE = "http://cdsware.cern.ch/invenio/" CFG_BIBFORMAT_BFX_LABEL_DEFINITIONS = { #record is a reserved keyword, don't use it #define one or more addresses for each name or zero if you plan to define them later 'controlfield': [r'/???'], 'datafield': [r'/?????'], 'datafield.subfield': [r'datafield/?'], 'recid': [r'/001'], 'article_id': [], 'language': [r'/041__/a'], 'title': [r'/245__/a'], 'subtitle': [r'/245__/b'], 'secondary_title': [r'/773__/p'], 'first_author': [r'/100__/a'], 'author': [r'/100__/a', r'/700__/a'], 'author.surname': [r'author#(?P.*),[ ]*(.*)'], 'author.names': [r'author#(.*),[ ]*(?P.*)'], 'abstract': [r'/520__/a'], 'publisher': [r'/260__/b'], 'publisher_location': [r'/260__/a'], 'issn': [r'/022__/a'], 'doi': [r'/773__/a'], 'journal_name_long': [r'/222__/a', r'/210__/a', r'/773__/p', r'/909C4/p'], 'journal_name_short': [r'/210__/a', r'/773__/p', r'/909C4/p'], 'journal_name': [r'/773__/p', r'/909C4/p'], 'journal_volume': [r'/773__/v', r'/909C4/v'], 'journal_issue': [r'/773__/n'], 'pages': [r'/773__/c', r'/909C4/c'], 'first_page': [r'/773__/c#(?P\d*)-(\d*)', r'/909C4/c#(?P\d*)-(\d*)'], 'last_page': [r'/773__/c#(\d*)-(?P\d*)', r'/909C4/c#(\d*)-(?P\d*)'], 'date': [r'/260__/c'], 'year': [r'/773__/y#(.*)(?P\d\d\d\d).*', r'/260__/c#(.*)(?P\d\d\d\d).*', r'/925__/a#(.*)(?P\d\d\d\d).*', r'/909C4/y'], 'doc_type': [r'/980__/a'], 'doc_status': [r'/980__/c'], 'uri': [r'/8564_/u', r'/8564_/q'], 'subject': [r'/65017/a'], 'keyword': [r'/6531_/a'], 'day': [], 'month': [], 'creation_date': [], 'reference': [] } CFG_BIBFORMAT_BFX_ERROR_MESSAGES = \ { 'ERR_BFX_TEMPLATE_REF_NO_NAME' : 'Error: Missing attribute "name" in TEMPLATE_REF.', 'ERR_BFX_TEMPLATE_NOT_FOUND' : 'Error: Template %s not found.', 'ERR_BFX_ELEMENT_NO_NAME' : 'Error: Missing attribute "name" in ELEMENT.', 'ERR_BFX_FIELD_NO_NAME' : 'Error: Missing attribute "name" in FIELD.', 'ERR_BFX_LOOP_NO_OBJECT' : 'Error: Missing attribute "object" in LOOP.', 'ERR_BFX_NO_SUCH_FIELD' : 'Error: Field %s is not defined', 'ERR_BFX_IF_NO_NAME' : 'Error: Missing attrbute "name" in IF.', 'ERR_BFX_TEXT_NO_VALUE' : 'Error: Missing attribute "value" in TEXT.', 'ERR_BFX_INVALID_RE' : 'Error: Invalid regular expression: %s', 'ERR_BFX_INVALID_OPERATOR_NAME' : 'Error: Name %s is not recognised as a valid operator name.', 'ERR_BFX_INVALID_DISPLAY_TYPE' : 'Error: Invalid display type. Must be one of: value, tag, ind1, ind2, code; received: %s', 'ERR_BFX_IF_WRONG_SYNTAX' : 'Error: Invalid syntax of IF statement.', 'ERR_BFX_DUPLICATE_NAME' : 'Error: Duplicate name: %s.', 'ERR_BFX_TEMPLATE_NO_NAME' : 'Error: No name defined for the template.', 'ERR_BFX_NO_TEMPLATES_FOUND' : 'Error: No templates found in the document.', 'ERR_BFX_TOO_MANY_TEMPLATES' : 'Error: More than one templates found in the document. No format found.' } CFG_BIBFORMAT_BFX_WARNING_MESSAGES = \ { 'WRN_BFX_TEMPLATE_NO_DESCRIPTION' : 'Warning: No description entered for the template.', 'WRN_BFX_TEMPLATE_NO_CONTENT' : 'Warning: No content type specified for the template. Using default: text/xml.', 'WRN_BFX_NO_FORMAT_FOUND' : 'Warning: No format found. Will look for a default template.' } diff --git a/modules/bibformat/lib/bibformat_config.py b/modules/bibformat/lib/bibformat_config.py index 39d99b050..caffdb43b 100644 --- a/modules/bibformat/lib/bibformat_config.py +++ b/modules/bibformat/lib/bibformat_config.py @@ -1,97 +1,97 @@ # -*- coding: utf-8 -*- ## ## $Id$ ## ## This file is part of CDS Invenio. ## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 CERN. ## ## CDS Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## CDS Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with CDS Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. # pylint: disable-msg=C0301 """BibFormat configuration parameters.""" __revision__ = "$Id$" import os -from invenio.config import etcdir, pylibdir +from invenio.config import CFG_ETCDIR, CFG_PYLIBDIR #True if old php format written in EL must be used by Invenio. #False if new python format must be used. If set to 'False' but #new format cannot be found, old format will be used. CFG_BIBFORMAT_USE_OLD_BIBFORMAT = False #Paths to main formats directories -CFG_BIBFORMAT_TEMPLATES_PATH = "%s%sbibformat%sformat_templates" % (etcdir, os.sep, os.sep) +CFG_BIBFORMAT_TEMPLATES_PATH = "%s%sbibformat%sformat_templates" % (CFG_ETCDIR, os.sep, os.sep) CFG_BIBFORMAT_ELEMENTS_IMPORT_PATH = "invenio.bibformat_elements" -CFG_BIBFORMAT_ELEMENTS_PATH = "%s%sinvenio%sbibformat_elements" % (pylibdir, os.sep, os.sep) -CFG_BIBFORMAT_OUTPUTS_PATH = "%s%sbibformat%soutput_formats" % (etcdir, os.sep, os.sep) +CFG_BIBFORMAT_ELEMENTS_PATH = "%s%sinvenio%sbibformat_elements" % (CFG_PYLIBDIR, os.sep, os.sep) +CFG_BIBFORMAT_OUTPUTS_PATH = "%s%sbibformat%soutput_formats" % (CFG_ETCDIR, os.sep, os.sep) #File extensions of formats CFG_BIBFORMAT_FORMAT_TEMPLATE_EXTENSION = "bft" CFG_BIBFORMAT_FORMAT_OUTPUT_EXTENSION = "bfo" CFG_BIBFORMAT_ERROR_MESSAGES = \ { 'ERR_BIBFORMAT_INVALID_TAG' : '%s is an invalid tag.', 'ERR_BIBFORMAT_NO_TEMPLATE_FOUND' : 'No template could be found for output format %s.', 'ERR_BIBFORMAT_CANNOT_RESOLVE_ELEMENT_NAME' : 'Could not find format element named %s.', 'ERR_BIBFORMAT_CANNOT_RESOLVE_OUTPUT_NAME' : 'Could not find output format named %s.', 'ERR_BIBFORMAT_CANNOT_RESOLVE_TEMPLATE_FILE' : 'Could not find format template named %s.', 'ERR_BIBFORMAT_FORMAT_ELEMENT_NOT_FOUND' : 'Format element %s could not be found.', 'ERR_BIBFORMAT_BAD_BFO_RECORD' : 'Could not initialize new BibFormatObject with record id %s.', 'ERR_BIBFORMAT_NB_OUTPUTS_LIMIT_REACHED' : 'Could not find a fresh name for output format %s.', 'ERR_BIBFORMAT_KB_ID_UNKNOWN' : 'Knowledge base with id %s is unknown.', 'ERR_BIBFORMAT_OUTPUT_FORMAT_CODE_UNKNOWN' : 'Output format with code %s could not be found.', 'ERR_BIBFORMAT_CANNOT_READ_TEMPLATE_FILE' : 'Format template %s cannot not be read. %s', 'ERR_BIBFORMAT_CANNOT_WRITE_TEMPLATE_FILE' : 'BibFormat could not write to format template %s. %s', 'ERR_BIBFORMAT_CANNOT_READ_OUTPUT_FILE' : 'Output format %s cannot not be read. %s', 'ERR_BIBFORMAT_CANNOT_WRITE_OUTPUT_FILE' : 'BibFormat could not write to output format %s. %s', 'ERR_BIBFORMAT_EVALUATING_ELEMENT' : 'Error when evaluating format element %s with parameters %s', 'ERR_BIBFORMAT_CANNOT_READ_ELEMENT_FILE' : 'Format element %s cannot not be read. %s', 'ERR_BIBFORMAT_INVALID_OUTPUT_RULE_FIELD' : 'Should be "tag field_number:" at line %s.', 'ERR_BIBFORMAT_INVALID_OUTPUT_RULE_FIELD_TAG' : 'Invalid tag "%s" at line %s.', 'ERR_BIBFORMAT_OUTPUT_CONDITION_OUTSIDE_FIELD': 'Condition "%s" is outside a tag specification at line %s.', 'ERR_BIBFORMAT_INVALID_OUTPUT_CONDITION' : 'Condition "%s" can only have a single separator --- at line %s.', 'ERR_BIBFORMAT_WRONG_OUTPUT_RULE_TEMPLATE_REF': 'Template "%s" does not exist at line %s.', 'ERR_BIBFORMAT_WRONG_OUTPUT_LINE' : 'Line %s could not be understood at line %s.', 'ERR_BIBFORMAT_OUTPUT_WRONG_TAG_CASE' : '"tag" must be lowercase in "%s" at line %s.', 'ERR_BIBFORMAT_OUTPUT_RULE_FIELD_COL' : 'Tag specification "%s" must end with column ":" at line %s.', 'ERR_BIBFORMAT_OUTPUT_TAG_MISSING' : 'Tag specification "%s" must start with "tag" at line %s.', 'ERR_BIBFORMAT_OUTPUT_WRONG_DEFAULT_CASE' : '"default" keyword must be lowercase in "%s" at line %s', 'ERR_BIBFORMAT_OUTPUT_RULE_DEFAULT_COL' : 'Missing column ":" after "default" in "%s" at line %s.', 'ERR_BIBFORMAT_OUTPUT_DEFAULT_MISSING' : 'Default template specification "%s" must start with "default :" at line %s.', 'ERR_BIBFORMAT_FORMAT_ELEMENT_FORMAT_FUNCTION': 'Format element %s has no function named "format".', 'ERR_BIBFORMAT_VALIDATE_NO_FORMAT' : 'No format specified for validation. Please specify one.', 'ERR_BIBFORMAT_TEMPLATE_HAS_NO_NAME' : 'Could not find a name specified in tag "" inside format template %s.', 'ERR_BIBFORMAT_TEMPLATE_HAS_NO_DESCRIPTION' : 'Could not find a description specified in tag "" inside format template %s.', 'ERR_BIBFORMAT_TEMPLATE_CALLS_UNREADABLE_ELEM': 'Format template %s calls unreadable element "%s". Check element file permissions.', 'ERR_BIBFORMAT_TEMPLATE_CALLS_UNLOADABLE_ELEM': 'Cannot load element "%s" in template %s. Check element code.', 'ERR_BIBFORMAT_TEMPLATE_CALLS_UNDEFINED_ELEM' : 'Format template %s calls undefined element "%s".', 'ERR_BIBFORMAT_TEMPLATE_WRONG_ELEM_ARG' : 'Format element %s uses unknown parameter "%s" in format template %s.', 'ERR_BIBFORMAT_IN_FORMAT_ELEMENT' : 'Error in format element %s. %s', 'ERR_BIBFORMAT_NO_RECORD_FOUND_FOR_PATTERN' : 'No Record Found for %s.', 'ERR_BIBFORMAT_NBMAX_NOT_INT' : '"nbMax" parameter for %s must be an "int".', 'ERR_BIBFORMAT_EVALUATING_ELEMENT_ESCAPE' : 'Escape mode for format element %s could not be retrieved. Using default mode instead.' } CFG_BIBFORMAT_WARNING_MESSAGES = \ { 'WRN_BIBFORMAT_OUTPUT_FORMAT_NAME_TOO_LONG' : 'Name %s is too long for output format %s in language %s. Truncated to first 256 characters.', 'WRN_BIBFORMAT_KB_NAME_UNKNOWN' : 'Cannot find knowledge base named %s.', 'WRN_BIBFORMAT_KB_MAPPING_UNKNOWN' : 'Cannot find a mapping with key %s in knowledge base %s.', 'WRN_BIBFORMAT_CANNOT_WRITE_IN_ETC_BIBFORMAT' : 'Cannot write in etc/bibformat dir of your Invenio installation. Check directory permission.', 'WRN_BIBFORMAT_CANNOT_WRITE_MIGRATION_STATUS' : 'Cannot write file migration_status.txt in etc/bibformat dir of your Invenio installation. Check file permission.', 'WRN_BIBFORMAT_CANNOT_EXECUTE_REQUEST' : 'Your request could not be executed.' } diff --git a/modules/bibformat/lib/bibformat_engine.py b/modules/bibformat/lib/bibformat_engine.py index 0668c8e59..2ccc2e37b 100644 --- a/modules/bibformat/lib/bibformat_engine.py +++ b/modules/bibformat/lib/bibformat_engine.py @@ -1,2008 +1,2008 @@ # -*- coding: utf-8 -*- ## ## $Id$ ## ## This file is part of CDS Invenio. ## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 CERN. ## ## CDS Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## CDS Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with CDS Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """ Formats a single XML Marc record using specified format. There is no API for the engine. Instead use bibformat.py. SEE: bibformat.py, bibformat_utils.py """ __revision__ = "$Id$" import re import sys import os import inspect import traceback import zlib import cgi from invenio.config import \ CFG_PATH_PHP, \ - bindir, \ + CFG_BINDIR, \ cdslang from invenio.errorlib import \ register_errors, \ get_msgs_for_code_list from invenio.bibrecord import \ create_record, \ record_get_field_instances, \ record_get_field_value, \ record_get_field_values from invenio.bibformat_xslt_engine import format from invenio.dbquery import run_sql from invenio.messages import \ language_list_long, \ wash_language, \ gettext_set_language from invenio import bibformat_dblayer from invenio.bibformat_config import \ CFG_BIBFORMAT_FORMAT_TEMPLATE_EXTENSION, \ CFG_BIBFORMAT_FORMAT_OUTPUT_EXTENSION, \ CFG_BIBFORMAT_TEMPLATES_PATH, \ CFG_BIBFORMAT_ELEMENTS_PATH, \ CFG_BIBFORMAT_OUTPUTS_PATH, \ CFG_BIBFORMAT_ELEMENTS_IMPORT_PATH from invenio.bibformat_utils import \ record_get_xml, \ parse_tag from invenio.htmlutils import HTMLWasher from invenio.webuser import collect_user_info if CFG_PATH_PHP: #Remove when call_old_bibformat is removed from xml.dom import minidom import tempfile # Cache for data we have already read and parsed format_templates_cache = {} format_elements_cache = {} format_outputs_cache = {} kb_mappings_cache = {} cdslangs = language_list_long() html_field = '' # String indicating that field should be # treated as HTML (and therefore no escaping of # HTML tags should occur. # Appears in some field values. washer = HTMLWasher() # Used to remove dangerous tags from HTML # sources # Regular expression for finding ... tag in format templates pattern_lang = re.compile(r''' #closing start tag (?P.*?) #anything but the next group (greedy) () #end tag ''', re.IGNORECASE | re.DOTALL | re.VERBOSE) # Builds regular expression for finding each known language in tags ln_pattern_text = r"<(" for lang in cdslangs: ln_pattern_text += lang[0] +r"|" ln_pattern_text = ln_pattern_text.rstrip(r"|") ln_pattern_text += r")>(.*?)" ln_pattern = re.compile(ln_pattern_text, re.IGNORECASE | re.DOTALL) # Regular expression for finding text to be translated translation_pattern = re.compile(r'_\((?P.*?)\)_', \ re.IGNORECASE | re.DOTALL | re.VERBOSE) # Regular expression for finding tag in format templates pattern_format_template_name = re.compile(r''' #closing start tag (?P.*?) #name value. any char that is not end tag ()(\n)? #end tag ''', re.IGNORECASE | re.DOTALL | re.VERBOSE) # Regular expression for finding tag in format templates pattern_format_template_desc = re.compile(r''' #closing start tag (?P.*?) #description value. any char that is not end tag (\n)? #end tag ''', re.IGNORECASE | re.DOTALL | re.VERBOSE) # Regular expression for finding tags in format templates pattern_tag = re.compile(r''' [^/\s]+) #any char but a space or slash \s* #any number of spaces (?P(\s* #params here (?P([^=\s])*)\s* #param name: any chars that is not a white space or equality. Followed by space(s) =\s* #equality: = followed by any number of spaces (?P[\'"]) #one of the separators (?P.*?) #param value: any chars that is not a separator like previous one (?P=sep) #same separator as starting one )*) #many params \s* #any number of spaces (/)?> #end of the tag ''', re.IGNORECASE | re.DOTALL | re.VERBOSE) # Regular expression for finding params inside tags in format templates pattern_function_params = re.compile(''' (?P([^=\s])*)\s* # Param name: any chars that is not a white space or equality. Followed by space(s) =\s* # Equality: = followed by any number of spaces (?P[\'"]) # One of the separators (?P.*?) # Param value: any chars that is not a separator like previous one (?P=sep) # Same separator as starting one ''', re.VERBOSE | re.DOTALL ) # Regular expression for finding format elements "params" attributes # (defined by @param) pattern_format_element_params = re.compile(''' @param\s* # Begins with @param keyword followed by space(s) (?P[^\s=]*)\s* # A single keyword, and then space(s) #(=\s*(?P[\'"]) # Equality, space(s) and then one of the separators #(?P.*?) # Default value: any chars that is not a separator like previous one #(?P=sep) # Same separator as starting one #)?\s* # Default value for param is optional. Followed by space(s) (?P.*) # Any text that is not end of line (thanks to MULTILINE parameter) ''', re.VERBOSE | re.MULTILINE) # Regular expression for finding format elements "see also" attribute # (defined by @see) pattern_format_element_seealso = re.compile('''@see\s*(?P.*)''', re.VERBOSE | re.MULTILINE) #Regular expression for finding 2 expressions in quotes, separated by #comma (as in template("1st","2nd") ) #Used when parsing output formats ## pattern_parse_tuple_in_quotes = re.compile(''' ## (?P[\'"]) ## (?P.*) ## (?P=sep1) ## \s*,\s* ## (?P[\'"]) ## (?P.*) ## (?P=sep2) ## ''', re.VERBOSE | re.MULTILINE) def call_old_bibformat(recID, format="HD", on_the_fly=False, verbose=0): """ FIXME: REMOVE FUNCTION WHEN MIGRATION IS DONE Calls BibFormat for the record RECID in the desired output format FORMAT. @param on_the_fly if False, try to return an already preformatted version of the record in the database Note: this functions always try to return HTML, so when bibformat returns XML with embedded HTML format inside the tag FMT $g, as is suitable for prestoring output formats, we perform un-XML-izing here in order to return HTML body only. """ out = "" res = [] if not on_the_fly: # look for formatted notice existence: query = "SELECT value, last_updated FROM bibfmt WHERE "\ "id_bibrec='%s' AND format='%s'" % (recID, format) res = run_sql(query, None, 1) if res: # record 'recID' is formatted in 'format', so print it if verbose == 9: last_updated = res[0][1] out += """\n
Found preformatted output for record %i (cache updated on %s). """ % (recID, last_updated) decompress = zlib.decompress return "%s" % decompress(res[0][0]) else: # record 'recID' is not formatted in 'format', # so try to call BibFormat on the fly or use default format: if verbose == 9: out += """\n
Formatting record %i on-the-fly with old BibFormat.
""" % recID # Retrieve MARCXML # Build it on-the-fly only if 'call_old_bibformat' was called # with format=xm and on_the_fly=True xm_record = record_get_xml(recID, 'xm', on_the_fly=(on_the_fly and format == 'xm')) ## import platform ## # Some problem have been found using either popen or os.system command. ## # Here is a temporary workaround until the issue is solved. ## if platform.python_compiler().find('Red Hat') > -1: ## # use os.system ## (result_code, result_path) = tempfile.mkstemp() -## command = "( %s/bibformat otype=%s ) > %s" % (bindir, format, result_path) +## command = "( %s/bibformat otype=%s ) > %s" % (CFG_BINDIR, format, result_path) ## (xm_code, xm_path) = tempfile.mkstemp() ## xm_file = open(xm_path, "w") ## xm_file.write(xm_record) ## xm_file.close() ## command = command + " <" + xm_path ## os.system(command) ## result_file = open(result_path,"r") ## bibformat_output = result_file.read() ## result_file.close() ## os.remove(result_path) ## os.remove(xm_path) ## else: ## # use popen - pipe_input, pipe_output, pipe_error = os.popen3(["%s/bibformat" % bindir, + pipe_input, pipe_output, pipe_error = os.popen3(["%s/bibformat" % CFG_BINDIR, "otype=%s" % format], 'rw') pipe_input.write(xm_record) pipe_input.flush() pipe_input.close() bibformat_output = pipe_output.read() pipe_output.close() pipe_error.close() if bibformat_output.startswith(""): dom = minidom.parseString(bibformat_output) for e in dom.getElementsByTagName('subfield'): if e.getAttribute('code') == 'g': for t in e.childNodes: out += t.data.encode('utf-8') else: out += bibformat_output return out def format_record(recID, of, ln=cdslang, verbose=0, search_pattern=[], xml_record=None, user_info=None): """ Formats a record given output format. Main entry function of bibformat engine. Returns a formatted version of the record in the specified language, search pattern, and with the specified output format. The function will define which format template must be applied. You can either specify an record ID to format, or give its xml representation. if 'xml_record' is not None, then use it instead of recID. 'user_info' allows to grant access to some functionalities on a page depending on the user's priviledges. 'user_info' is the same object as the one returned by 'webuser.collect_user_info(req)' @param recID the ID of record to format @param of an output format code (or short identifier for the output format) @param ln the language to use to format the record @param verbose the level of verbosity from 0 to 9 (O: silent, 5: errors, 7: errors and warnings, stop if error in format elements 9: errors and warnings, stop if error (debug mode )) @param search_pattern list of strings representing the user request in web interface @param xml_record an xml string representing the record to format @param user_info the information of the user who will view the formatted page @return formatted record """ out = "" errors_ = [] # Temporary workflow (during migration of formats): # Call new BibFormat # But if format not found for new BibFormat, then call old BibFormat #Create a BibFormat Object to pass that contain record and context bfo = BibFormatObject(recID, ln, search_pattern, xml_record, user_info, of) #Find out which format template to use based on record and output format. template = decide_format_template(bfo, of) if verbose == 9 and template is not None: out += """\n
Using %s template for record %i. """ % (template, recID) ############### FIXME: REMOVE WHEN MIGRATION IS DONE ############### path = "%s%s%s" % (CFG_BIBFORMAT_TEMPLATES_PATH, os.sep, template) if template is None or not os.access(path, os.R_OK): # template not found in new BibFormat. Call old one if verbose == 9: if template is None: out += """\n
No template found for output format %s and record %i. (Check invenio.err log file for more details) """ % (of, recID) else: out += """\n
Template %s could not be read. """ % (template) if CFG_PATH_PHP: if verbose == 9: out += """\n
Using old BibFormat for record %s. """ % recID return out + call_old_bibformat(recID, format=of, on_the_fly=True, verbose=verbose) ############################# END ################################## error = get_msgs_for_code_list([("ERR_BIBFORMAT_NO_TEMPLATE_FOUND", of)], stream='error', ln=cdslang) errors_.append(error) if verbose == 0: register_errors(error, 'error') elif verbose > 5: return out + error[0][1] return out # Format with template (out_, errors) = format_with_format_template(template, bfo, verbose) errors_.extend(errors) out += out_ return out def decide_format_template(bfo, of): """ Returns the format template name that should be used for formatting given output format and BibFormatObject. Look at of rules, and take the first matching one. If no rule matches, returns None To match we ignore lettercase and spaces before and after value of rule and value of record @param bfo a BibFormatObject @param of the code of the output format to use """ output_format = get_output_format(of) for rule in output_format['rules']: value = bfo.field(rule['field']).strip()#Remove spaces pattern = rule['value'].strip() #Remove spaces match_obj = re.match(pattern, value, re.IGNORECASE) if match_obj is not None and \ match_obj.start() == 0 and match_obj.end() == len(value): return rule['template'] template = output_format['default'] if template != '': return template else: return None def format_with_format_template(format_template_filename, bfo, verbose=0, format_template_code=None): """ Format a record given a format template. Also returns errors Returns a formatted version of the record represented by bfo, in the language specified in bfo, and with the specified format template. If format_template_code is provided, the template will not be loaded from format_template_filename (but format_template_filename will still be used to determine if bft or xsl transformation applies). This allows to preview format code without having to save file on disk. @param format_template_filename the dilename of a format template @param bfo the object containing parameters for the current formatting @param format_template_code if not empty, use code as template instead of reading format_template_filename (used for previews) @param verbose the level of verbosity from 0 to 9 (O: silent, 5: errors, 7: errors and warnings, 9: errors and warnings, stop if error (debug mode )) @return tuple (formatted text, errors) """ _ = gettext_set_language(bfo.lang) def translate(match): """ Translate matching values """ word = match.group("word") translated_word = _(word) return translated_word errors_ = [] if format_template_code is not None: format_content = str(format_template_code) else: format_content = get_format_template(format_template_filename)['code'] if format_template_filename is None or \ format_template_filename.endswith("."+CFG_BIBFORMAT_FORMAT_TEMPLATE_EXTENSION): # .bft filtered_format = filter_languages(format_content, bfo.lang) localized_format = translation_pattern.sub(translate, filtered_format) (evaluated_format, errors) = eval_format_template_elements(localized_format, bfo, verbose) errors_ = errors else: #.xsl # Fetch MARCXML. On-the-fly xm if we are now formatting in xm xml_record = record_get_xml(bfo.recID, 'xm', on_the_fly=(bfo.format != 'xm')) # Transform MARCXML using stylesheet evaluated_format = format(xml_record, template_source=format_content) return (evaluated_format, errors_) def eval_format_template_elements(format_template, bfo, verbose=0): """ Evalutes the format elements of the given template and replace each element with its value. Also returns errors. Prepare the format template content so that we can directly replace the marc code by their value. This implies: 1) Look for special tags 2) replace special tags by their evaluation @param format_template the format template code @param bfo the object containing parameters for the current formatting @param verbose the level of verbosity from 0 to 9 (O: silent, 5: errors, 7: errors and warnings, 9: errors and warnings, stop if error (debug mode )) @return tuple (result, errors) """ errors_ = [] # First define insert_element_code(match), used in re.sub() function def insert_element_code(match): """ Analyses 'match', interpret the corresponding code, and return the result of the evaluation. Called by substitution in 'eval_format_template_elements(...)' @param match a match object corresponding to the special tag that must be interpreted """ function_name = match.group("function_name") try: format_element = get_format_element(function_name, verbose) except Exception, e: if verbose >= 5: return '' + \ cgi.escape(str(e)).replace('\n', '
') + \ '
' if format_element is None: error = get_msgs_for_code_list([("ERR_BIBFORMAT_CANNOT_RESOLVE_ELEMENT_NAME", function_name)], stream='error', ln=cdslang) errors_.append(error) if verbose >= 5: return '' + \ error[0][1]+'' else: params = {} # Look for function parameters given in format template code all_params = match.group('params') if all_params is not None: function_params_iterator = pattern_function_params.finditer(all_params) for param_match in function_params_iterator: name = param_match.group('param') value = param_match.group('value') params[name] = value # Evaluate element with params and return (Do not return errors) (result, errors) = eval_format_element(format_element, bfo, params, verbose) errors_.append(errors) return result # Substitute special tags in the format by our own text. # Special tags have the form format = pattern_tag.sub(insert_element_code, format_template) return (format, errors_) def eval_format_element(format_element, bfo, parameters={}, verbose=0): """ Returns the result of the evaluation of the given format element name, with given BibFormatObject and parameters. Also returns the errors of the evaluation. @param format_element a format element structure as returned by get_format_element @param bfo a BibFormatObject used for formatting @param parameters a dict of parameters to be used for formatting. Key is parameter and value is value of parameter @param verbose the level of verbosity from 0 to 9 (O: silent, 5: errors, 7: errors and warnings, 9: errors and warnings, stop if error (debug mode )) @return tuple (result, errors) """ errors = [] #Load special values given as parameters prefix = parameters.get('prefix', "") suffix = parameters.get('suffix', "") default_value = parameters.get('default', "") escape = parameters.get('escape', "") output_text = '' # 3 possible cases: # a) format element file is found: we execute it # b) format element file is not found, but exist in tag table (e.g. bfe_isbn) # c) format element is totally unknown. Do nothing or report error if format_element is not None and format_element['type'] == "python": # a) We found an element with the tag name, of type "python" # Prepare a dict 'params' to pass as parameter to 'format' # function of element params = {} # Look for parameters defined in format element # Fill them with specified default values and values # given as parameters for param in format_element['attrs']['params']: name = param['name'] default = param['default'] params[name] = parameters.get(name, default) # Add BibFormatObject params['bfo'] = bfo # Execute function with given parameters and return result. function = format_element['code'] try: output_text = apply(function, (), params) except Exception, e: name = format_element['attrs']['name'] error = ("ERR_BIBFORMAT_EVALUATING_ELEMENT", name, str(params)) errors.append(error) if verbose == 0: register_errors(errors, 'error') elif verbose >= 5: tb = sys.exc_info()[2] error_string = get_msgs_for_code_list(error, stream='error', ln=cdslang) stack = traceback.format_exception(Exception, e, tb, limit=None) output_text = ''+ \ str(error_string[0][1]) + "".join(stack) +' ' # None can be returned when evaluating function if output_text is None: output_text = "" else: output_text = str(output_text) # Escaping: # (1) By default, everything is escaped in mode 1 # (2) If evaluated element has 'escape_values()' function, use # its returned value as escape mode, and override (1) # (3) If template has a defined parameter (in allowed values), # use it, and override (1) and (2) # (1) escape_mode = 1 # (2) escape_function = format_element['escape_function'] if escape_function is not None: try: escape_mode = apply(escape_function, (), {'bfo': bfo}) except Exception, e: error = ("ERR_BIBFORMAT_EVALUATING_ELEMENT_ESCAPE", name) errors.append(error) if verbose == 0: register_errors(errors, 'error') elif verbose >= 5: tb = sys.exc_info()[2] error_string = get_msgs_for_code_list(error, stream='error', ln=cdslang) output_text += ''+ \ str(error_string[0][1]) +' ' # (3) if escape in ['0', '1', '2', '3', '4']: escape_mode = int(escape) #If escape is equal to 1, then escape all # HTML reserved chars. if escape_mode > 0: output_text = escape_field(output_text, mode=escape_mode) # Add prefix and suffix if they have been given as parameters and if # the evaluation of element is not empty if output_text.strip() != "": output_text = prefix + output_text + suffix # Add the default value if output_text is empty if output_text == "": output_text = default_value return (output_text, errors) elif format_element is not None and format_element['type'] == "field": # b) We have not found an element in files that has the tag # name. Then look for it in the table "tag" # # # # Load special values given as parameters separator = parameters.get('separator ', "") nbMax = parameters.get('nbMax', "") escape = parameters.get('escape', "1") # By default, escape here # Get the fields tags that have to be printed tags = format_element['attrs']['tags'] output_text = [] # Get values corresponding to tags for tag in tags: p_tag = parse_tag(tag) values = record_get_field_values(bfo.get_record(), p_tag[0], p_tag[1], p_tag[2], p_tag[3]) if len(values)>0 and isinstance(values[0], dict): #flatten dict to its values only values_list = map(lambda x: x.values(), values) #output_text.extend(values) for values in values_list: output_text.extend(values) else: output_text.extend(values) if nbMax != "": try: nbMax = int(nbMax) output_text = output_text[:nbMax] except: name = format_element['attrs']['name'] error = ("ERR_BIBFORMAT_NBMAX_NOT_INT", name) errors.append(error) if verbose < 5: register_errors(error, 'error') elif verbose >= 5: error_string = get_msgs_for_code_list(error, stream='error', ln=cdslang) output_text = output_text.append(error_string[0][1]) # Add prefix and suffix if they have been given as parameters and if # the evaluation of element is not empty. # If evaluation is empty string, return default value if it exists. # Else return empty string if ("".join(output_text)).strip() != "": # If escape is equal to 1, then escape all # HTML reserved chars. if escape == '1': output_text = cgi.escape(separator.join(output_text)) else: output_text = separator.join(output_text) output_text = prefix + output_text + suffix else: #Return default value output_text = default_value return (output_text, errors) else: # c) Element is unknown error = get_msgs_for_code_list([("ERR_BIBFORMAT_CANNOT_RESOLVE_ELEMENT_NAME", format_element)], stream='error', ln=cdslang) errors.append(error) if verbose < 5: register_errors(error, 'error') return ("", errors) elif verbose >= 5: if verbose >= 9: sys.exit(error[0][1]) return ('' + \ error[0][1]+'', errors) def filter_languages(format_template, ln='en'): """ Filters the language tags that do not correspond to the specified language. @param format_template the format template code @param ln the language that is NOT filtered out from the template @return the format template with unnecessary languages filtered out """ # First define search_lang_tag(match) and clean_language_tag(match), used # in re.sub() function def search_lang_tag(match): """ Searches for the ... tag and remove inner localized tags such as , , that are not current_lang. If current_lang cannot be found inside ... , try to use 'cdslang' @param match a match object corresponding to the special tag that must be interpreted """ current_lang = ln def clean_language_tag(match): """ Return tag text content if tag language of match is output language. Called by substitution in 'filter_languages(...)' @param match a match object corresponding to the special tag that must be interpreted """ if match.group(1) == current_lang: return match.group(2) else: return "" # End of clean_language_tag lang_tag_content = match.group("langs") # Try to find tag with current lang. If it does not exists, # then current_lang becomes cdslang until the end of this # replace pattern_current_lang = re.compile(r"<("+current_lang+ \ r")\s*>(.*?)()", re.IGNORECASE | re.DOTALL) if re.search(pattern_current_lang, lang_tag_content) is None: current_lang = cdslang cleaned_lang_tag = ln_pattern.sub(clean_language_tag, lang_tag_content) return cleaned_lang_tag # End of search_lang_tag filtered_format_template = pattern_lang.sub(search_lang_tag, format_template) return filtered_format_template def get_format_template(filename, with_attributes=False): """ Returns the structured content of the given formate template. if 'with_attributes' is true, returns the name and description. Else 'attrs' is not returned as key in dictionary (it might, if it has already been loaded previously) {'code':"Some template code" 'attrs': {'name': "a name", 'description': "a description"} } @param filename the filename of an format template @param with_attributes if True, fetch the attributes (names and description) for format' @return strucured content of format template """ # Get from cache whenever possible global format_templates_cache if not filename.endswith("."+CFG_BIBFORMAT_FORMAT_TEMPLATE_EXTENSION) and \ not filename.endswith(".xsl"): return None if format_templates_cache.has_key(filename): # If we must return with attributes and template exist in # cache with attributes then return cache. # Else reload with attributes if with_attributes and \ format_templates_cache[filename].has_key('attrs'): return format_templates_cache[filename] format_template = {'code':""} try: path = "%s%s%s" % (CFG_BIBFORMAT_TEMPLATES_PATH, os.sep, filename) format_file = open(path) format_content = format_file.read() format_file.close() # Load format template code # Remove name and description if filename.endswith("."+CFG_BIBFORMAT_FORMAT_TEMPLATE_EXTENSION): code_and_description = pattern_format_template_name.sub("", format_content) code = pattern_format_template_desc.sub("", code_and_description) else: code = format_content format_template['code'] = code except Exception, e: errors = get_msgs_for_code_list([("ERR_BIBFORMAT_CANNOT_READ_TEMPLATE_FILE", filename, str(e))], stream='error', ln=cdslang) register_errors(errors, 'error') # Save attributes if necessary if with_attributes: format_template['attrs'] = get_format_template_attrs(filename) # Cache and return format_templates_cache[filename] = format_template return format_template def get_format_templates(with_attributes=False): """ Returns the list of all format templates, as dictionary with filenames as keys if 'with_attributes' is true, returns the name and description. Else 'attrs' is not returned as key in each dictionary (it might, if it has already been loaded previously) [{'code':"Some template code" 'attrs': {'name': "a name", 'description': "a description"} }, ... } @param with_attributes if True, fetch the attributes (names and description) for formats """ format_templates = {} files = os.listdir(CFG_BIBFORMAT_TEMPLATES_PATH) for filename in files: if filename.endswith("."+CFG_BIBFORMAT_FORMAT_TEMPLATE_EXTENSION) or \ filename.endswith(".xsl"): format_templates[filename] = get_format_template(filename, with_attributes) return format_templates def get_format_template_attrs(filename): """ Returns the attributes of the format template with given filename The attributes are {'name', 'description'} Caution: the function does not check that path exists or that the format element is valid. @param the path to a format element """ attrs = {} attrs['name'] = "" attrs['description'] = "" try: template_file = open("%s%s%s" % (CFG_BIBFORMAT_TEMPLATES_PATH, os.sep, filename)) code = template_file.read() template_file.close() match = None if filename.endswith(".xsl"): # .xsl attrs['name'] = filename[:-4] else: # .bft match = pattern_format_template_name.search(code) if match is not None: attrs['name'] = match.group('name') else: attrs['name'] = filename match = pattern_format_template_desc.search(code) if match is not None: attrs['description'] = match.group('desc').rstrip('.') except Exception, e: errors = get_msgs_for_code_list([("ERR_BIBFORMAT_CANNOT_READ_TEMPLATE_FILE", filename, str(e))], stream='error', ln=cdslang) register_errors(errors, 'error') attrs['name'] = filename return attrs def get_format_element(element_name, verbose=0, with_built_in_params=False): """ Returns the format element structured content. Return None if element cannot be loaded (file not found, not readable or invalid) The returned structure is {'attrs': {some attributes in dict. See get_format_element_attrs_from_*} 'code': the_function_code, 'type':"field" or "python" depending if element is defined in file or table, 'escape_function': the function to call to know if element output must be escaped} @param element_name the name of the format element to load @param verbose the level of verbosity from 0 to 9 (O: silent, 5: errors, 7: errors and warnings, 9: errors and warnings, stop if error (debug mode )) @param with_built_in_params if True, load the parameters built in all elements @return a dictionary with format element attributes """ # Get from cache whenever possible global format_elements_cache errors = [] # Resolve filename and prepare 'name' as key for the cache filename = resolve_format_element_filename(element_name) if filename is not None: name = filename.upper() else: name = element_name.upper() if format_elements_cache.has_key(name): element = format_elements_cache[name] if not with_built_in_params or \ (with_built_in_params and \ element['attrs'].has_key('builtin_params')): return element if filename is None: # Element is maybe in tag table if bibformat_dblayer.tag_exists_for_name(element_name): format_element = {'attrs': get_format_element_attrs_from_table( \ element_name, with_built_in_params), 'code':None, 'escape_function':None, 'type':"field"} # Cache and returns format_elements_cache[name] = format_element return format_element else: errors = get_msgs_for_code_list([("ERR_BIBFORMAT_FORMAT_ELEMENT_NOT_FOUND", element_name)], stream='error', ln=cdslang) if verbose == 0: register_errors(errors, 'error') elif verbose >= 5: sys.stderr.write(errors[0][1]) return None else: format_element = {} module_name = filename if module_name.endswith(".py"): module_name = module_name[:-3] # Load element try: module = __import__(CFG_BIBFORMAT_ELEMENTS_IMPORT_PATH + \ "." + module_name) # Load last module in import path # For eg. load bfe_name in # invenio.bibformat_elements.bfe_name # Used to keep flexibility regarding where elements # directory is (for eg. test cases) components = CFG_BIBFORMAT_ELEMENTS_IMPORT_PATH.split(".") for comp in components[1:]: module = getattr(module, comp) except Exception, e: # We catch all exceptions here, as we just want to print # traceback in all cases tb = sys.exc_info()[2] stack = traceback.format_exception(Exception, e, tb, limit=None) errors = get_msgs_for_code_list([("ERR_BIBFORMAT_IN_FORMAT_ELEMENT", element_name,"\n" + "\n".join(stack[-2:-1]))], stream='error', ln=cdslang) if verbose == 0: register_errors(errors, 'error') elif verbose >= 5: sys.stderr.write(errors[0][1]) if errors: if verbose >= 7: raise Exception, errors[0][1] return None # Load function 'format()' inside element try: function_format = module.__dict__[module_name].format format_element['code'] = function_format except AttributeError, e: errors = get_msgs_for_code_list([("ERR_BIBFORMAT_FORMAT_ELEMENT_FORMAT_FUNCTION", element_name)], stream='warning', ln=cdslang) if verbose == 0: register_errors(errors, 'error') elif verbose >= 5: sys.stderr.write(errors[0][1]) if errors: if verbose >= 7: raise Exception, errors[0][1] return None # Load function 'escape_values()' inside element function_escape = getattr(module.__dict__[module_name], 'escape_values', None) format_element['escape_function'] = function_escape # Prepare, cache and return format_element['attrs'] = get_format_element_attrs_from_function( \ function_format, element_name, with_built_in_params) format_element['type'] = "python" format_elements_cache[name] = format_element return format_element def get_format_elements(with_built_in_params=False): """ Returns the list of format elements attributes as dictionary structure Elements declared in files have priority over element declared in 'tag' table The returned object has this format: {element_name1: {'attrs': {'description':..., 'seealso':... 'params':[{'name':..., 'default':..., 'description':...}, ...] 'builtin_params':[{'name':..., 'default':..., 'description':...}, ...] }, 'code': code_of_the_element }, element_name2: {...}, ...} Returns only elements that could be loaded (not error in code) @return a dict of format elements with name as key, and a dict as attributes @param with_built_in_params if True, load the parameters built in all elements """ format_elements = {} mappings = bibformat_dblayer.get_all_name_tag_mappings() for name in mappings: format_elements[name.upper().replace(" ", "_").strip()] = get_format_element(name, with_built_in_params=with_built_in_params) files = os.listdir(CFG_BIBFORMAT_ELEMENTS_PATH) for filename in files: filename_test = filename.upper().replace(" ", "_") if filename_test.endswith(".PY") and filename.upper() != "__INIT__.PY": if filename_test.startswith("BFE_"): filename_test = filename_test[4:] element_name = filename_test[:-3] element = get_format_element(element_name, with_built_in_params=with_built_in_params) if element is not None: format_elements[element_name] = element return format_elements def get_format_element_attrs_from_function(function, element_name, with_built_in_params=False): """ Returns the attributes of the function given as parameter. It looks for standard parameters of the function, default values and comments in the docstring. The attributes are {'description', 'seealso':['element.py', ...], 'params':{name:{'name', 'default', 'description'}, ...], name2:{}} The attributes are {'name' : "name of element" #basically the name of 'name' parameter 'description': "a string description of the element", 'seealso' : ["element_1.py", "element_2.py", ...] #a list of related elements 'params': [{'name':"param_name", #a list of parameters for this element (except 'bfo') 'default':"default value", 'description': "a description"}, ...], 'builtin_params': {name: {'name':"param_name",#the parameters builtin for all elem of this kind 'default':"default value", 'description': "a description"}, ...}, } @param function the formatting function of a format element @param element_name the name of the element @param with_built_in_params if True, load the parameters built in all elements """ attrs = {} attrs['description'] = "" attrs['name'] = element_name.replace(" ", "_").upper() attrs['seealso'] = [] docstring = function.__doc__ if isinstance(docstring, str): # Look for function description in docstring #match = pattern_format_element_desc.search(docstring) description = docstring.split("@param")[0] description = description.split("@see")[0] attrs['description'] = description.strip().rstrip('.') # Look for @see in docstring match = pattern_format_element_seealso.search(docstring) if match is not None: elements = match.group('see').rstrip('.').split(",") for element in elements: attrs['seealso'].append(element.strip()) params = {} # Look for parameters in function definition (args, varargs, varkw, defaults) = inspect.getargspec(function) # Prepare args and defaults_list such that we can have a mapping # from args to defaults args.reverse() if defaults is not None: defaults_list = list(defaults) defaults_list.reverse() else: defaults_list = [] for arg, default in map(None, args, defaults_list): if arg == "bfo": #Don't keep this as parameter. It is hidden to users, and #exists in all elements of this kind continue param = {} param['name'] = arg if default is None: #In case no check is made inside element, we prefer to #print "" (nothing) than None in output param['default'] = "" else: param['default'] = default param['description'] = "(no description provided)" params[arg] = param if isinstance(docstring, str): # Look for @param descriptions in docstring. # Add description to existing parameters in params dict params_iterator = pattern_format_element_params.finditer(docstring) for match in params_iterator: name = match.group('name') if params.has_key(name): params[name]['description'] = match.group('desc').rstrip('.') attrs['params'] = params.values() # Load built-in parameters if necessary if with_built_in_params: builtin_params = [] # Add 'prefix' parameter param_prefix = {} param_prefix['name'] = "prefix" param_prefix['default'] = "" param_prefix['description'] = """A prefix printed only if the record has a value for this element""" builtin_params.append(param_prefix) # Add 'suffix' parameter param_suffix = {} param_suffix['name'] = "suffix" param_suffix['default'] = "" param_suffix['description'] = """A suffix printed only if the record has a value for this element""" builtin_params.append(param_suffix) # Add 'default' parameter param_default = {} param_default['name'] = "default" param_default['default'] = "" param_default['description'] = """A default value printed if the record has no value for this element""" builtin_params.append(param_default) # Add 'escape' parameter param_escape = {} param_escape['name'] = "escape" param_escape['default'] = "" param_escape['description'] = """If set to 1, replaces special characters '&', '<' and '>' of this element by SGML entities""" builtin_params.append(param_escape) attrs['builtin_params'] = builtin_params return attrs def get_format_element_attrs_from_table(element_name, with_built_in_params=False): """ Returns the attributes of the format element with given name in 'tag' table. Returns None if element_name does not exist in tag table. The attributes are {'name' : "name of element" #basically the name of 'element_name' parameter 'description': "a string description of the element", 'seealso' : [] #a list of related elements. Always empty in this case 'params': [], #a list of parameters for this element. Always empty in this case 'builtin_params': [{'name':"param_name", #the parameters builtin for all elem of this kind 'default':"default value", 'description': "a description"}, ...], 'tags':["950.1", 203.a] #the list of tags printed by this element } @param element_name an element name in database @param element_name the name of the element @param with_built_in_params if True, load the parameters built in all elements """ attrs = {} tags = bibformat_dblayer.get_tags_from_name(element_name) field_label = "field" if len(tags)>1: field_label = "fields" attrs['description'] = "Prints %s %s of the record" % (field_label, ", ".join(tags)) attrs['name'] = element_name.replace(" ", "_").upper() attrs['seealso'] = [] attrs['params'] = [] attrs['tags'] = tags # Load built-in parameters if necessary if with_built_in_params: builtin_params = [] # Add 'prefix' parameter param_prefix = {} param_prefix['name'] = "prefix" param_prefix['default'] = "" param_prefix['description'] = """A prefix printed only if the record has a value for this element""" builtin_params.append(param_prefix) # Add 'suffix' parameter param_suffix = {} param_suffix['name'] = "suffix" param_suffix['default'] = "" param_suffix['description'] = """A suffix printed only if the record has a value for this element""" builtin_params.append(param_suffix) # Add 'separator' parameter param_separator = {} param_separator['name'] = "separator" param_separator['default'] = " " param_separator['description'] = """A separator between elements of the field""" builtin_params.append(param_separator) # Add 'nbMax' parameter param_nbMax = {} param_nbMax['name'] = "nbMax" param_nbMax['default'] = "" param_nbMax['description'] = """The maximum number of values to print for this element. No limit if not specified""" builtin_params.append(param_nbMax) # Add 'default' parameter param_default = {} param_default['name'] = "default" param_default['default'] = "" param_default['description'] = """A default value printed if the record has no value for this element""" builtin_params.append(param_default) # Add 'escape' parameter param_escape = {} param_escape['name'] = "escape" param_escape['default'] = "" param_escape['description'] = """If set to 1, replaces special characters '&', '<' and '>' of this element by SGML entities""" builtin_params.append(param_escape) attrs['builtin_params'] = builtin_params return attrs def get_output_format(code, with_attributes=False, verbose=0): """ Returns the structured content of the given output format If 'with_attributes' is true, also returns the names and description of the output formats, else 'attrs' is not returned in dict (it might, if it has already been loaded previously). if output format corresponding to 'code' is not found return an empty structure. See get_output_format_attrs() to learn more on the attributes {'rules': [ {'field': "980__a", 'value': "PREPRINT", 'template': "filename_a.bft", }, {...} ], 'attrs': {'names': {'generic':"a name", 'sn':{'en': "a name", 'fr':"un nom"}, 'ln':{'en':"a long name"}} 'description': "a description" 'code': "fnm1", 'content_type': "application/ms-excel", 'visibility': 1 } 'default':"filename_b.bft" } @param code the code of an output_format @param with_attributes if True, fetch the attributes (names and description) for format @param verbose the level of verbosity from 0 to 9 (O: silent, 5: errors, 7: errors and warnings, 9: errors and warnings, stop if error (debug mode )) @return strucured content of output format """ output_format = {'rules':[], 'default':""} filename = resolve_output_format_filename(code, verbose) if filename is None: errors = get_msgs_for_code_list([("ERR_BIBFORMAT_OUTPUT_FORMAT_CODE_UNKNOWN", code)], stream='error', ln=cdslang) register_errors(errors, 'error') if with_attributes: #Create empty attrs if asked for attributes output_format['attrs'] = get_output_format_attrs(code, verbose) return output_format # Get from cache whenever possible global format_outputs_cache if format_outputs_cache.has_key(filename): # If was must return with attributes but cache has not # attributes, then load attributes if with_attributes and not \ format_outputs_cache[filename].has_key('attrs'): format_outputs_cache[filename]['attrs'] = get_output_format_attrs(code, verbose) return format_outputs_cache[filename] try: if with_attributes: output_format['attrs'] = get_output_format_attrs(code, verbose) path = "%s%s%s" % (CFG_BIBFORMAT_OUTPUTS_PATH, os.sep, filename ) format_file = open(path) current_tag = '' for line in format_file: line = line.strip() if line == "": # Ignore blank lines continue if line.endswith(":"): # Retrieve tag # Remove : spaces and eol at the end of line clean_line = line.rstrip(": \n\r") # The tag starts at second position current_tag = "".join(clean_line.split()[1:]).strip() elif line.find('---') != -1: words = line.split('---') template = words[-1].strip() condition = ''.join(words[:-1]) value = "" output_format['rules'].append({'field': current_tag, 'value': condition, 'template': template, }) elif line.find(':') != -1: # Default case default = line.split(':')[1].strip() output_format['default'] = default except Exception, e: errors = get_msgs_for_code_list([("ERR_BIBFORMAT_CANNOT_READ_OUTPUT_FILE", filename, str(e))], stream='error', ln=cdslang) register_errors(errors, 'error') # Cache and return format_outputs_cache[filename] = output_format return output_format def get_output_format_attrs(code, verbose=0): """ Returns the attributes of an output format. The attributes contain 'code', which is the short identifier of the output format (to be given as parameter in format_record function to specify the output format), 'description', a description of the output format, 'visibility' the visibility of the format in the output format list on public pages and 'names', the localized names of the output format. If 'content_type' is specified then the search_engine will send a file with this content type and with result of formatting as content to the user. The 'names' dict always contais 'generic', 'ln' (for long name) and 'sn' (for short names) keys. 'generic' is the default name for output format. 'ln' and 'sn' contain long and short localized names of the output format. Only the languages for which a localization exist are used. {'names': {'generic':"a name", 'sn':{'en': "a name", 'fr':"un nom"}, 'ln':{'en':"a long name"}} 'description': "a description" 'code': "fnm1", 'content_type': "application/ms-excel", 'visibility': 1 } @param code the short identifier of the format @param verbose the level of verbosity from 0 to 9 (O: silent, 5: errors, 7: errors and warnings, 9: errors and warnings, stop if error (debug mode )) @return strucured content of output format attributes """ if code.endswith("."+CFG_BIBFORMAT_FORMAT_OUTPUT_EXTENSION): code = code[:-(len(CFG_BIBFORMAT_FORMAT_OUTPUT_EXTENSION) + 1)] attrs = {'names':{'generic':"", 'ln':{}, 'sn':{}}, 'description':'', 'code':code.upper(), 'content_type':"", 'visibility':1} filename = resolve_output_format_filename(code, verbose) if filename is None: return attrs attrs['names'] = bibformat_dblayer.get_output_format_names(code) attrs['description'] = bibformat_dblayer.get_output_format_description(code) attrs['content_type'] = bibformat_dblayer.get_output_format_content_type(code) attrs['visibility'] = bibformat_dblayer.get_output_format_visibility(code) return attrs def get_output_formats(with_attributes=False): """ Returns the list of all output format, as a dictionary with their filename as key If 'with_attributes' is true, also returns the names and description of the output formats, else 'attrs' is not returned in dicts (it might, if it has already been loaded previously). See get_output_format_attrs() to learn more on the attributes {'filename_1.bfo': {'rules': [ {'field': "980__a", 'value': "PREPRINT", 'template': "filename_a.bft", }, {...} ], 'attrs': {'names': {'generic':"a name", 'sn':{'en': "a name", 'fr':"un nom"}, 'ln':{'en':"a long name"}} 'description': "a description" 'code': "fnm1" } 'default':"filename_b.bft" }, 'filename_2.bfo': {...}, ... } @return the list of output formats """ output_formats = {} files = os.listdir(CFG_BIBFORMAT_OUTPUTS_PATH) for filename in files: if filename.endswith("."+CFG_BIBFORMAT_FORMAT_OUTPUT_EXTENSION): code = "".join(filename.split(".")[:-1]) output_formats[filename] = get_output_format(code, with_attributes) return output_formats def get_kb_mapping(kb, string, default=""): """ Returns the value of the string' in the knowledge base 'kb'. If kb does not exist or string does not exist in kb, returns 'default' string value. @param kb a knowledge base name @param string a key in a knowledge base @param default a default value if 'string' is not in 'kb' @return the value corresponding to the given string in given kb """ global kb_mappings_cache if kb_mappings_cache.has_key(kb): kb_cache = kb_mappings_cache[kb] if kb_cache.has_key(string): value = kb_mappings_cache[kb][string] if value is None: return default else: return value else: # Precreate for caching this kb kb_mappings_cache[kb] = {} value = bibformat_dblayer.get_kb_mapping_value(kb, string) kb_mappings_cache[kb][str(string)] = value if value is None: return default else: return value def resolve_format_element_filename(string): """ Returns the filename of element corresponding to string This is necessary since format templates code call elements by ignoring case, for eg. is the same as . It is also recommended that format elements filenames are prefixed with bfe_ . We need to look for these too. The name of the element has to start with "BFE_". @param name a name for a format element @return the corresponding filename, with right case """ if not string.endswith(".py"): name = string.replace(" ", "_").upper() +".PY" else: name = string.replace(" ", "_").upper() files = os.listdir(CFG_BIBFORMAT_ELEMENTS_PATH) for filename in files: test_filename = filename.replace(" ", "_").upper() if test_filename == name or \ test_filename == "BFE_" + name or \ "BFE_" + test_filename == name: return filename # No element with that name found # Do not log error, as it might be a normal execution case: # element can be in database return None def resolve_output_format_filename(code, verbose=0): """ Returns the filename of output corresponding to code This is necessary since output formats names are not case sensitive but most file systems are. @param code the code for an output format @param verbose the level of verbosity from 0 to 9 (O: silent, 5: errors, 7: errors and warnings, 9: errors and warnings, stop if error (debug mode )) @return the corresponding filename, with right case, or None if not found """ #Remove non alphanumeric chars (except .) code = re.sub(r"[^.0-9a-zA-Z]", "", code) if not code.endswith("."+CFG_BIBFORMAT_FORMAT_OUTPUT_EXTENSION): code = re.sub(r"\W", "", code) code += "."+CFG_BIBFORMAT_FORMAT_OUTPUT_EXTENSION files = os.listdir(CFG_BIBFORMAT_OUTPUTS_PATH) for filename in files: if filename.upper() == code.upper(): return filename # No output format with that name found errors = get_msgs_for_code_list([("ERR_BIBFORMAT_CANNOT_RESOLVE_OUTPUT_NAME", code)], stream='error', ln=cdslang) if verbose == 0: register_errors(errors, 'error') elif verbose >= 5: sys.stderr.write(errors[0][1]) if verbose >= 9: sys.exit(errors[0][1]) return None def get_fresh_format_template_filename(name): """ Returns a new filename and name for template with given name. Used when writing a new template to a file, so that the name has no space, is unique in template directory Returns (unique_filename, modified_name) @param a name for a format template @return the corresponding filename, and modified name if necessary """ #name = re.sub(r"\W", "", name) #Remove non alphanumeric chars name = name.replace(" ", "_") filename = name # Remove non alphanumeric chars (except .) filename = re.sub(r"[^.0-9a-zA-Z]", "", filename) path = CFG_BIBFORMAT_TEMPLATES_PATH + os.sep + filename \ + "." + CFG_BIBFORMAT_FORMAT_TEMPLATE_EXTENSION index = 1 while os.path.exists(path): index += 1 filename = name + str(index) path = CFG_BIBFORMAT_TEMPLATES_PATH + os.sep + filename \ + "." + CFG_BIBFORMAT_FORMAT_TEMPLATE_EXTENSION if index > 1: returned_name = (name + str(index)).replace("_", " ") else: returned_name = name.replace("_", " ") return (filename + "." + CFG_BIBFORMAT_FORMAT_TEMPLATE_EXTENSION, returned_name) #filename.replace("_", " ")) def get_fresh_output_format_filename(code): """ Returns a new filename for output format with given code. Used when writing a new output format to a file, so that the code has no space, is unique in output format directory. The filename also need to be at most 6 chars long, as the convention is that filename == output format code (+ .extension) We return an uppercase code Returns (unique_filename, modified_code) @param code the code of an output format @return the corresponding filename, and modified code if necessary """ #code = re.sub(r"\W", "", code) #Remove non alphanumeric chars code = code.upper().replace(" ", "_") # Remove non alphanumeric chars (except .) code = re.sub(r"[^.0-9a-zA-Z]", "", code) if len(code) > 6: code = code[:6] filename = code path = CFG_BIBFORMAT_OUTPUTS_PATH + os.sep + filename \ + "." + CFG_BIBFORMAT_FORMAT_OUTPUT_EXTENSION index = 2 while os.path.exists(path): filename = code + str(index) if len(filename) > 6: filename = code[:-(len(str(index)))]+str(index) index += 1 path = CFG_BIBFORMAT_OUTPUTS_PATH + os.sep + filename \ + "." + CFG_BIBFORMAT_FORMAT_OUTPUT_EXTENSION # We should not try more than 99999... Well I don't see how we # could get there.. Sanity check. if index >= 99999: errors = get_msgs_for_code_list([("ERR_BIBFORMAT_NB_OUTPUTS_LIMIT_REACHED", code)], stream='error', ln=cdslang) register_errors(errors, 'error') sys.exit("Output format cannot be named as %s"%code) return (filename + "." + CFG_BIBFORMAT_FORMAT_OUTPUT_EXTENSION, filename) def clear_caches(): """ Clear the caches (Output Format, Format Templates and Format Elements) """ global format_templates_cache, format_elements_cache , \ format_outputs_cache, kb_mappings_cache format_templates_cache = {} format_elements_cache = {} format_outputs_cache = {} kb_mappings_cache = {} class BibFormatObject: """ An object that encapsulates a record and associated methods, and that is given as parameter to all format elements 'format' function. The object is made specifically for a given formatting, i.e. it includes for example the language for the formatting. The object provides basic accessors to the record. For full access, one can get the record with get_record() and then use BibRecord methods on the returned object. """ # The record record = None # The language in which the formatting has to be done lang = cdslang # A list of string describing the context in which the record has # to be formatted. # It represents the words of the user request in web interface search search_pattern = [] # The id of the record recID = 0 uid = None # DEPRECATED: use bfo.user_info['uid'] instead # The information about the user, as returned by # 'webuser.collect_user_info(req)' user_info = None # The format in which the record is being formatted format = '' req = None # DEPRECATED: use bfo.user_info instead def __init__(self, recID, ln=cdslang, search_pattern=[], xml_record=None, user_info=None, format=''): """ Creates a new bibformat object, with given record. You can either specify an record ID to format, or give its xml representation. if 'xml_record' is not None, use 'xml_record' instead of recID for the record. 'user_info' allows to grant access to some functionalities on a page depending on the user's priviledges. It is a dictionary in the following form: user_info = { 'remote_ip' : '', 'remote_host' : '', 'referer' : '', 'uri' : '', 'agent' : '', 'apache_user' : '', 'apache_group' : [], 'uid' : -1, 'nickname' : '', 'email' : '', 'group' : [], 'guest' : '1' } @param recID the id of a record @param ln the language in which the record has to be formatted @param search_pattern list of string representing the request used by the user in web interface @param xml_record a xml string of the record to format @param user_info the information of the user who will view the formatted page @param format the format used for formatting this record """ if xml_record is not None: # If record is given as parameter self.record = create_record(xml_record)[0] recID = record_get_field_value(self.record, "001") self.lang = wash_language(ln) self.search_pattern = search_pattern self.recID = recID self.format = format self.user_info = user_info if self.user_info is None: self.user_info = collect_user_info(None) def get_record(self): """ Returns the record of this BibFormatObject instance @return the record structure as returned by BibRecord """ # Create record if necessary if self.record is None: # on-the-fly creation if current output is xm record = create_record(record_get_xml(self.recID, 'xm', on_the_fly=(self.format.lower() == 'xm'))) self.record = record[0] return self.record def control_field(self, tag, escape=0): """ Returns the value of control field given by tag in record @param tag the marc code of a field @param escape 1 if returned value should be escaped. Else 0. @return value of field tag in record """ if self.get_record() is None: #Case where BibRecord could not parse object return '' p_tag = parse_tag(tag) field_value = record_get_field_value(self.get_record(), p_tag[0], p_tag[1], p_tag[2], p_tag[3]) if escape == 0: return field_value else: return escape_field(field_value, escape) def field(self, tag, escape=0): """ Returns the value of the field corresponding to tag in the current record. If the value does not exist, return empty string 'escape' parameter allows to escape special characters of the field. The value of escape can be: 0 - no escaping 1 - escape all HTML characters 2 - escape all HTML characters by default. If field starts with , escape only unsafe characters, but leave basic HTML tags. @param tag the marc code of a field @param escape 1 if returned value should be escaped. Else 0. (see above for other modes) @return value of field tag in record """ list_of_fields = self.fields(tag) if len(list_of_fields) > 0: # Escaping below if escape == 0: return list_of_fields[0] else: return escape_field(list_of_fields[0], escape) else: return "" def fields(self, tag, escape=0, repeatable_subfields_p=False): """ Returns the list of values corresonding to "tag". If tag has an undefined subcode (such as 999C5), the function returns a list of dictionaries, whoose keys are the subcodes and the values are the values of tag.subcode. If the tag has a subcode, simply returns list of values corresponding to tag. Eg. for given MARC: 999C5 $a value_1a $b value_1b 999C5 $b value_2b 999C5 $b value_3b $b value_3b_bis >> bfo.fields('999C5b') >> ['value_1b', 'value_2b', 'value_3b', 'value_3b_bis'] >> bfo.fields('999C5') >> [{'a':'value_1a', 'b':'value_1b'}, {'b':'value_2b'}, {'b':'value_3b'}] By default the function returns only one value for each subfield (that is it considers that repeatable subfields are not allowed). It is why in the above example 'value3b_bis' is not shown for bfo.fields('999C5'). (Note that it is not defined which of value_3b or value_3b_bis is returned). This is to simplify the use of the function, as most of the time subfields are not repeatable (in that way we get a string instead of a list). You can allow repeatable subfields by setting 'repeatable_subfields_p' parameter to True. In this mode, the above example would return: >> bfo.fields('999C5b', repeatable_subfields_p=True) >> ['value_1b', 'value_2b', 'value_3b'] >> bfo.fields('999C5', repeatable_subfields_p=True) >> [{'a':['value_1a'], 'b':['value_1b']}, {'b':['value_2b']}, {'b':['value_3b', 'value3b_bis']}] NOTICE THAT THE RETURNED STRUCTURE IS DIFFERENT. Also note that whatever the value of 'repeatable_subfields_p' is, bfo.fields('999C5b') always show all fields, even repeatable ones. This is because the parameter has no impact on the returned structure (it is always a list). 'escape' parameter allows to escape special characters of the fields. The value of escape can be: 0 - no escaping 1 - escape all HTML characters 2 - escape all dangerous HTML tags. 3 - Mix of mode 1 and 2. If value of field starts with , then use mode 2. Else use mode 1. 4 - Remove all HTML tags @param tag the marc code of a field @param escape 1 if returned values should be escaped. Else 0. @repeatable_subfields_p if True, returns the list of subfields in the dictionary @return values of field tag in record """ if self.get_record() is None: # Case where BibRecord could not parse object return [] p_tag = parse_tag(tag) if p_tag[3] != "": # Subcode has been defined. Simply returns list of values values = record_get_field_values(self.get_record(), p_tag[0], p_tag[1], p_tag[2], p_tag[3]) if escape == 0: return values else: return [escape_field(value, escape) for value in values] else: # Subcode is undefined. Returns list of dicts. # However it might be the case of a control field. instances = record_get_field_instances(self.get_record(), p_tag[0], p_tag[1], p_tag[2]) if repeatable_subfields_p: list_of_instances = [] for instance in instances: instance_dict = {} for subfield in instance[0]: if not instance_dict.has_key(subfield[0]): instance_dict[subfield[0]] = [] if escape == 0: instance_dict[subfield[0]].append(subfield[1]) else: instance_dict[subfield[0]].append(escape_field(subfield[1], escape)) list_of_instances.append(instance_dict) return list_of_instances else: if escape == 0: return [dict(instance[0]) for instance in instances] else: return [dict([ (subfield[0], escape_field(subfield[1], escape)) \ for subfield in instance[0] ]) \ for instance in instances] def kb(self, kb, string, default=""): """ Returns the value of the "string" in the knowledge base "kb". If kb does not exist or string does not exist in kb, returns 'default' string or empty string if not specified. @param kb a knowledge base name @param string the string we want to translate @param default a default value returned if 'string' not found in 'kb' """ if string is None: return default val = get_kb_mapping(kb, string, default) if val is None: return default else: return val def escape_field(value, mode=0): """ Utility function used to escape the value of a field in given mode. - mode 0: no escaping - mode 1: escaping all HTML/XML characters (escaped chars are shown as escaped) - mode 2: escaping dangerous HTML tags to avoid XSS, but keep basic one (such as
) Escaped characters are removed. - mode 3: mix of mode 1 and mode 2. If field_value starts with , then use mode 2. Else use mode 1. - mode 4: escaping all HTML/XML tags (escaped tags are removed) - """ if mode == 1: return cgi.escape(value) elif mode == 2: return washer.wash(value, allowed_attribute_whitelist=['href', 'name', 'class'] ) elif mode == 3: if value.lstrip(' \n').startswith(html_field): return washer.wash(value, allowed_attribute_whitelist=['href', 'name', 'class'] ) else: return cgi.escape(value) elif mode == 4: return washer.wash(value, allowed_attribute_whitelist=[], allowed_tag_whitelist=[] ) else: return value def bf_profile(): """ Runs a benchmark """ for i in range(1, 51): format_record(i, "HD", ln=cdslang, verbose=9, search_pattern=[]) return if __name__ == "__main__": import profile import pstats #bf_profile() profile.run('bf_profile()', "bibformat_profile") p = pstats.Stats("bibformat_profile") p.strip_dirs().sort_stats("cumulative").print_stats() diff --git a/modules/bibformat/lib/bibformat_engine_tests.py b/modules/bibformat/lib/bibformat_engine_tests.py index 57cfc0447..a2d921f37 100644 --- a/modules/bibformat/lib/bibformat_engine_tests.py +++ b/modules/bibformat/lib/bibformat_engine_tests.py @@ -1,695 +1,695 @@ # -*- coding: utf-8 -*- ## ## $Id$ ## ## This file is part of CDS Invenio. ## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 CERN. ## ## CDS Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## CDS Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with CDS Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """Test cases for the BibFormat engine. Also test some utilities function in bibformat_utils module""" __revision__ = "$Id$" # pylint: disable-msg=C0301 import unittest import os import sys from invenio import bibformat_engine from invenio import bibformat_utils from invenio import bibformat_config from invenio import bibformatadminlib from invenio import bibrecord -from invenio.config import tmpdir +from invenio.config import CFG_TMPDIR #CFG_BIBFORMAT_OUTPUTS_PATH = "..%setc%soutput_formats" % (os.sep, os.sep) #CFG_BIBFORMAT_TEMPLATES_PATH = "..%setc%sformat_templates" % (os.sep, os.sep) #CFG_BIBFORMAT_ELEMENTS_PATH = "elements" -CFG_BIBFORMAT_OUTPUTS_PATH = "%s" % (tmpdir) -CFG_BIBFORMAT_TEMPLATES_PATH = "%s" % (tmpdir) -CFG_BIBFORMAT_ELEMENTS_PATH = "%s%stests_bibformat_elements" % (tmpdir, os.sep) +CFG_BIBFORMAT_OUTPUTS_PATH = "%s" % (CFG_TMPDIR) +CFG_BIBFORMAT_TEMPLATES_PATH = "%s" % (CFG_TMPDIR) +CFG_BIBFORMAT_ELEMENTS_PATH = "%s%stests_bibformat_elements" % (CFG_TMPDIR, os.sep) CFG_BIBFORMAT_ELEMENTS_IMPORT_PATH = "tests_bibformat_elements" class FormatTemplateTest(unittest.TestCase): """ bibformat - tests on format templates""" def test_get_format_template(self): """bibformat - format template parsing and returned structure""" bibformat_engine.CFG_BIBFORMAT_TEMPLATES_PATH = CFG_BIBFORMAT_TEMPLATES_PATH #Test correct parsing and structure template_1 = bibformat_engine.get_format_template("Test1.bft", with_attributes=True) self.assert_(template_1 is not None) self.assertEqual(template_1['code'], "test") self.assertEqual(template_1['attrs']['name'], "name_test") self.assertEqual(template_1['attrs']['description'], "desc_test") #Test correct parsing and structure of file without description or name template_2 = bibformat_engine.get_format_template("Test_2.bft", with_attributes=True) self.assert_(template_2 is not None) self.assertEqual(template_2['code'], "test") self.assertEqual(template_2['attrs']['name'], "Test_2.bft") self.assertEqual(template_2['attrs']['description'], "") #Test correct parsing and structure of file without description or name unknown_template = bibformat_engine.get_format_template("test_no_template.test", with_attributes=True) self.assertEqual(unknown_template, None) def test_get_format_templates(self): """ bibformat - loading multiple format templates""" bibformat_engine.CFG_BIBFORMAT_TEMPLATES_PATH = CFG_BIBFORMAT_TEMPLATES_PATH templates = bibformat_engine.get_format_templates(with_attributes=True) #test correct loading self.assert_("Test1.bft" in templates.keys()) self.assert_("Test_2.bft" in templates.keys()) self.assert_("Test3.bft" in templates.keys()) self.assert_("Test_no_template.test" not in templates.keys()) #Test correct pasrsing and structure self.assertEqual(templates['Test1.bft']['code'], "test") self.assertEqual(templates['Test1.bft']['attrs']['name'], "name_test") self.assertEqual(templates['Test1.bft']['attrs']['description'], "desc_test") def test_get_format_template_attrs(self): """ bibformat - correct parsing of attributes in format template""" bibformat_engine.CFG_BIBFORMAT_TEMPLATES_PATH = CFG_BIBFORMAT_TEMPLATES_PATH attrs = bibformat_engine.get_format_template_attrs("Test1.bft") self.assertEqual(attrs['name'], "name_test") self.assertEqual(attrs['description'], "desc_test") def test_get_fresh_format_template_filename(self): """ bibformat - getting fresh filename for format template""" bibformat_engine.CFG_BIBFORMAT_TEMPLATES_PATH = CFG_BIBFORMAT_TEMPLATES_PATH filename_and_name_1 = bibformat_engine.get_fresh_format_template_filename("Test") self.assert_(len(filename_and_name_1) >= 2) self.assertEqual(filename_and_name_1[0], "Test.bft") filename_and_name_2 = bibformat_engine.get_fresh_format_template_filename("Test1") self.assert_(len(filename_and_name_2) >= 2) self.assert_(filename_and_name_2[0] != "Test1.bft") path = bibformat_engine.CFG_BIBFORMAT_TEMPLATES_PATH + os.sep + filename_and_name_2[0] self.assert_(not os.path.exists(path)) class FormatElementTest(unittest.TestCase): """ bibformat - tests on format templates""" def setUp(self): # pylint: disable-msg=C0103 """bibformat - setting python path to test elements""" - sys.path.append('%s' % tmpdir) + sys.path.append('%s' % CFG_TMPDIR) def test_resolve_format_element_filename(self): """bibformat - resolving format elements filename """ bibformat_engine.CFG_BIBFORMAT_ELEMENTS_PATH = CFG_BIBFORMAT_ELEMENTS_PATH #Test elements filename starting without bfe_, with underscore instead of space filenames = ["test 1", "test 1.py", "bfe_test 1", "bfe_test 1.py", "BFE_test 1", "BFE_TEST 1", "BFE_TEST 1.py", "BFE_TeST 1.py", "BFE_TeST 1", "BfE_TeST 1.py", "BfE_TeST 1","test_1", "test_1.py", "bfe_test_1", "bfe_test_1.py", "BFE_test_1", "BFE_TEST_1", "BFE_TEST_1.py", "BFE_Test_1.py", "BFE_TeST_1", "BfE_TeST_1.py", "BfE_TeST_1"] for i in range(len(filenames)-2): filename_1 = bibformat_engine.resolve_format_element_filename(filenames[i]) self.assert_(filename_1 is not None) filename_2 = bibformat_engine.resolve_format_element_filename(filenames[i+1]) self.assertEqual(filename_1, filename_2) #Test elements filename starting with bfe_, and with underscores instead of spaces filenames = ["test 2", "test 2.py", "bfe_test 2", "bfe_test 2.py", "BFE_test 2", "BFE_TEST 2", "BFE_TEST 2.py", "BFE_TeST 2.py", "BFE_TeST 2", "BfE_TeST 2.py", "BfE_TeST 2","test_2", "test_2.py", "bfe_test_2", "bfe_test_2.py", "BFE_test_2", "BFE_TEST_2", "BFE_TEST_2.py", "BFE_TeST_2.py", "BFE_TeST_2", "BfE_TeST_2.py", "BfE_TeST_2"] for i in range(len(filenames)-2): filename_1 = bibformat_engine.resolve_format_element_filename(filenames[i]) self.assert_(filename_1 is not None) filename_2 = bibformat_engine.resolve_format_element_filename(filenames[i+1]) self.assertEqual(filename_1, filename_2) #Test non existing element non_existing_element = bibformat_engine.resolve_format_element_filename("BFE_NON_EXISTING_ELEMENT") self.assertEqual(non_existing_element, None) def test_get_format_element(self): """bibformat - format elements parsing and returned structure""" bibformat_engine.CFG_BIBFORMAT_ELEMENTS_PATH = CFG_BIBFORMAT_ELEMENTS_PATH bibformat_engine.CFG_BIBFORMAT_ELEMENTS_IMPORT_PATH = CFG_BIBFORMAT_ELEMENTS_IMPORT_PATH #Test loading with different kind of names, for element with spaces in name, without bfe_ element_1 = bibformat_engine.get_format_element("test 1", with_built_in_params=True) self.assert_(element_1 is not None) element_1_bis = bibformat_engine.get_format_element("bfe_tEst_1.py", with_built_in_params=True) self.assertEqual(element_1, element_1_bis) #Test loading with different kind of names, for element without spaces in name, wit bfe_ element_2 = bibformat_engine.get_format_element("test 2", with_built_in_params=True) self.assert_(element_2 is not None) element_2_bis = bibformat_engine.get_format_element("bfe_tEst_2.py", with_built_in_params=True) self.assertEqual(element_2, element_2_bis) #Test loading incorrect elements element_3 = bibformat_engine.get_format_element("test 3", with_built_in_params=True) self.assertEqual(element_3, None) element_4 = bibformat_engine.get_format_element("test 4", with_built_in_params=True) self.assertEqual(element_4, None) unknown_element = bibformat_engine.get_format_element("TEST_NO_ELEMENT", with_built_in_params=True) self.assertEqual(unknown_element, None) #Test element without docstring element_5 = bibformat_engine.get_format_element("test_5", with_built_in_params=True) self.assert_(element_5 is not None) self.assertEqual(element_5['attrs']['description'], '') self.assert_({'name':"param1", 'description':"(no description provided)", 'default':""} in element_5['attrs']['params'] ) self.assertEqual(element_5['attrs']['seealso'], []) #Test correct parsing: #Test type of element self.assertEqual(element_1['type'], "python") #Test name = element filename, with underscore instead of spaces, #without BFE_ and uppercase self.assertEqual(element_1['attrs']['name'], "TEST_1") #Test description parsing self.assertEqual(element_1['attrs']['description'], "Prints test") #Test @see parsing self.assertEqual(element_1['attrs']['seealso'], ["element2.py", "unknown_element.py"]) #Test @param parsing self.assert_({'name':"param1", 'description':"desc 1", 'default':""} in element_1['attrs']['params'] ) self.assert_({'name':"param2", 'description':"desc 2", 'default':"default value"} in element_1['attrs']['params'] ) #Test non existing element non_existing_element = bibformat_engine.get_format_element("BFE_NON_EXISTING_ELEMENT") self.assertEqual(non_existing_element, None) def test_get_format_element_attrs_from_function(self): """ bibformat - correct parsing of attributes in 'format' docstring""" bibformat_engine.CFG_BIBFORMAT_ELEMENTS_PATH = CFG_BIBFORMAT_ELEMENTS_PATH bibformat_engine.CFG_BIBFORMAT_ELEMENTS_IMPORT_PATH = CFG_BIBFORMAT_ELEMENTS_IMPORT_PATH element_1 = bibformat_engine.get_format_element("test 1", with_built_in_params=True) function = element_1['code'] attrs = bibformat_engine.get_format_element_attrs_from_function(function, element_1['attrs']['name'], with_built_in_params=True) self.assertEqual(attrs['name'], "TEST_1") #Test description parsing self.assertEqual(attrs['description'], "Prints test") #Test @see parsing self.assertEqual(attrs['seealso'], ["element2.py", "unknown_element.py"]) def test_get_format_elements(self): """bibformat - multiple format elements parsing and returned structure""" bibformat_engine.CFG_BIBFORMAT_ELEMENTS_PATH = CFG_BIBFORMAT_ELEMENTS_PATH bibformat_engine.CFG_BIBFORMAT_ELEMENTS_IMPORT_PATH = CFG_BIBFORMAT_ELEMENTS_IMPORT_PATH elements = bibformat_engine.get_format_elements() self.assert_(isinstance(elements, dict)) self.assertEqual(elements['TEST_1']['attrs']['name'], "TEST_1") self.assertEqual(elements['TEST_2']['attrs']['name'], "TEST_2") self.assert_("TEST_3" not in elements.keys()) self.assert_("TEST_4" not in elements.keys()) def test_get_tags_used_by_element(self): """bibformat - identification of tag usage inside element""" bibformat_engine.CFG_BIBFORMAT_ELEMENTS_PATH = bibformat_config.CFG_BIBFORMAT_ELEMENTS_PATH bibformat_engine.CFG_BIBFORMAT_ELEMENTS_IMPORT_PATH = bibformat_config.CFG_BIBFORMAT_ELEMENTS_IMPORT_PATH tags = bibformatadminlib.get_tags_used_by_element('bfe_abstract.py') self.failUnless(len(tags) == 4, 'Could not correctly identify tags used in bfe_abstract.py') class OutputFormatTest(unittest.TestCase): """ bibformat - tests on output formats""" def test_get_output_format(self): """ bibformat - output format parsing and returned structure """ bibformat_engine.CFG_BIBFORMAT_OUTPUTS_PATH = CFG_BIBFORMAT_OUTPUTS_PATH filename_1 = bibformat_engine.resolve_output_format_filename("test1") output_1 = bibformat_engine.get_output_format(filename_1, with_attributes=True) self.assertEqual(output_1['attrs']['names']['generic'], "") self.assert_(isinstance(output_1['attrs']['names']['ln'], dict)) self.assert_(isinstance(output_1['attrs']['names']['sn'], dict)) self.assertEqual(output_1['attrs']['code'], "TEST1") self.assert_(len(output_1['attrs']['code']) <= 6) self.assertEqual(len(output_1['rules']), 4) self.assertEqual(output_1['rules'][0]['field'], '980.a') self.assertEqual(output_1['rules'][0]['template'], 'Picture_HTML_detailed.bft') self.assertEqual(output_1['rules'][0]['value'], 'PICTURE ') self.assertEqual(output_1['rules'][1]['field'], '980.a') self.assertEqual(output_1['rules'][1]['template'], 'Article.bft') self.assertEqual(output_1['rules'][1]['value'], 'ARTICLE') self.assertEqual(output_1['rules'][2]['field'], '980__a') self.assertEqual(output_1['rules'][2]['template'], 'Thesis_detailed.bft') self.assertEqual(output_1['rules'][2]['value'], 'THESIS ') self.assertEqual(output_1['rules'][3]['field'], '980__a') self.assertEqual(output_1['rules'][3]['template'], 'Pub.bft') self.assertEqual(output_1['rules'][3]['value'], 'PUBLICATION ') filename_2 = bibformat_engine.resolve_output_format_filename("TEST2") output_2 = bibformat_engine.get_output_format(filename_2, with_attributes=True) self.assertEqual(output_2['attrs']['names']['generic'], "") self.assert_(isinstance(output_2['attrs']['names']['ln'], dict)) self.assert_(isinstance(output_2['attrs']['names']['sn'], dict)) self.assertEqual(output_2['attrs']['code'], "TEST2") self.assert_(len(output_2['attrs']['code']) <= 6) self.assertEqual(output_2['rules'], []) unknown_output = bibformat_engine.get_output_format("unknow", with_attributes=True) self.assertEqual(unknown_output, {'rules':[], 'default':"", 'attrs':{'names':{'generic':"", 'ln':{}, 'sn':{}}, 'description':'', 'code':"UNKNOW", 'visibility': 1, 'content_type':""}}) def test_get_output_formats(self): """ bibformat - loading multiple output formats """ bibformat_engine.CFG_BIBFORMAT_OUTPUTS_PATH = CFG_BIBFORMAT_OUTPUTS_PATH outputs = bibformat_engine.get_output_formats(with_attributes=True) self.assert_(isinstance(outputs, dict)) self.assert_("TEST1.bfo" in outputs.keys()) self.assert_("TEST2.bfo" in outputs.keys()) self.assert_("unknow.bfo" not in outputs.keys()) #Test correct parsing output_1 = outputs["TEST1.bfo"] self.assertEqual(output_1['attrs']['names']['generic'], "") self.assert_(isinstance(output_1['attrs']['names']['ln'], dict)) self.assert_(isinstance(output_1['attrs']['names']['sn'], dict)) self.assertEqual(output_1['attrs']['code'], "TEST1") self.assert_(len(output_1['attrs']['code']) <= 6) def test_get_output_format_attrs(self): """ bibformat - correct parsing of attributes in output format""" bibformat_engine.CFG_BIBFORMAT_OUTPUTS_PATH = CFG_BIBFORMAT_OUTPUTS_PATH attrs= bibformat_engine.get_output_format_attrs("TEST1") self.assertEqual(attrs['names']['generic'], "") self.assert_(isinstance(attrs['names']['ln'], dict)) self.assert_(isinstance(attrs['names']['sn'], dict)) self.assertEqual(attrs['code'], "TEST1") self.assert_(len(attrs['code']) <= 6) def test_resolve_output_format(self): """ bibformat - resolving output format filename""" bibformat_engine.CFG_BIBFORMAT_OUTPUTS_PATH = CFG_BIBFORMAT_OUTPUTS_PATH filenames = ["test1", "test1.bfo", "TEST1", "TeST1", "TEST1.bfo", "test1"] for i in range(len(filenames)-2): filename_1 = bibformat_engine.resolve_output_format_filename(filenames[i]) self.assert_(filename_1 is not None) filename_2 = bibformat_engine.resolve_output_format_filename(filenames[i+1]) self.assertEqual(filename_1, filename_2) def test_get_fresh_output_format_filename(self): """ bibformat - getting fresh filename for output format""" bibformat_engine.CFG_BIBFORMAT_OUTPUTS_PATH = CFG_BIBFORMAT_OUTPUTS_PATH filename_and_name_1 = bibformat_engine.get_fresh_output_format_filename("test") self.assert_(len(filename_and_name_1) >= 2) self.assertEqual(filename_and_name_1[0], "TEST.bfo") filename_and_name_1_bis = bibformat_engine.get_fresh_output_format_filename("") self.assert_(len(filename_and_name_1_bis) >= 2) self.assertEqual(filename_and_name_1_bis[0], "TEST.bfo") filename_and_name_2 = bibformat_engine.get_fresh_output_format_filename("test1") self.assert_(len(filename_and_name_2) >= 2) self.assert_(filename_and_name_2[0] != "TEST1.bfo") path = bibformat_engine.CFG_BIBFORMAT_OUTPUTS_PATH + os.sep + filename_and_name_2[0] self.assert_(not os.path.exists(path)) filename_and_name_3 = bibformat_engine.get_fresh_output_format_filename("test1testlong") self.assert_(len(filename_and_name_3) >= 2) self.assert_(filename_and_name_3[0] != "TEST1TESTLONG.bft") self.assert_(len(filename_and_name_3[0]) <= 6 + 1 + len(bibformat_config.CFG_BIBFORMAT_FORMAT_OUTPUT_EXTENSION)) path = bibformat_engine.CFG_BIBFORMAT_OUTPUTS_PATH + os.sep + filename_and_name_3[0] self.assert_(not os.path.exists(path)) class PatternTest(unittest.TestCase): """ bibformat - tests on re patterns""" def test_pattern_lang(self): """ bibformat - correctness of pattern 'pattern_lang'""" text = '''

Here is my test text

Some wordsQuelques motsEinige Wörter garbage Here ends the middle of my test text EnglishFrançaisDeutsch Here ends my test text

''' result = bibformat_engine.pattern_lang.search(text) self.assertEqual(result.group("langs"), "Some wordsQuelques motsEinige Wörter garbage ") text = '''

Here is my test text

''' result = bibformat_engine.pattern_lang.search(text) self.assertEqual(result.group("langs"), "Some wordsQuelques motsEinige Wörter garbage ") def test_ln_pattern(self): """ bibformat - correctness of pattern 'ln_pattern'""" text = "Some wordsQuelques motsEinige Wörter garbage " result = bibformat_engine.ln_pattern.search(text) self.assertEqual(result.group(1), "en") self.assertEqual(result.group(2), "Some words") def test_pattern_format_template_name(self): """ bibformat - correctness of pattern 'pattern_format_template_name'""" text = ''' garbage a name a description on 2 lines

the content of the template

content ''' result = bibformat_engine.pattern_format_template_name.search(text) self.assertEqual(result.group('name'), "a name") def test_pattern_format_template_desc(self): """ bibformat - correctness of pattern 'pattern_format_template_desc'""" text = ''' garbage a name a description on 2 lines

the content of the template

content ''' result = bibformat_engine.pattern_format_template_desc.search(text) self.assertEqual(result.group('desc'), '''a description on 2 lines ''') def test_pattern_tag(self): """ bibformat - correctness of pattern 'pattern_tag'""" text = ''' garbage but part of content a name a description on 2 lines

the content of the template

my content is so nice! ''' result = bibformat_engine.pattern_tag.search(text) self.assertEqual(result.group('function_name'), "tiTLE") self.assertEqual(result.group('params').strip(), '''param1="value1" param2=""''') def test_pattern_function_params(self): """ bibformat - correctness of pattern 'test_pattern_function_params'""" text = ''' param1="" param2="value2" param3="value3" garbage ''' names = ["param1", "param2", "param3"] values = ["", "value2", "value3"] results = bibformat_engine.pattern_format_element_params.finditer(text) #TODO param_i = 0 for match in results: self.assertEqual(match.group('param'), names[param_i]) self.assertEqual(match.group('value'), values [param_i]) param_i += 1 def test_pattern_format_element_params(self): """ bibformat - correctness of pattern 'pattern_format_element_params'""" text = ''' a description for my element some text @param param1 desc1 @param param2 desc2 @see seethis, seethat ''' names = ["param1", "param2"] descriptions = ["desc1", "desc2"] results = bibformat_engine.pattern_format_element_params.finditer(text) #TODO param_i = 0 for match in results: self.assertEqual(match.group('name'), names[param_i]) self.assertEqual(match.group('desc'), descriptions[param_i]) param_i += 1 def test_pattern_format_element_seealso(self): """ bibformat - correctness of pattern 'pattern_format_element_seealso' """ text = ''' a description for my element some text @param param1 desc1 @param param2 desc2 @see seethis, seethat ''' result = bibformat_engine.pattern_format_element_seealso.search(text) self.assertEqual(result.group('see').strip(), 'seethis, seethat') class MiscTest(unittest.TestCase): """ bibformat - tests on various functions""" def test_parse_tag(self): """ bibformat - result of parsing tags""" tags_and_parsed_tags = ['245COc', ['245', 'C', 'O', 'c'], '245C_c', ['245', 'C', '' , 'c'], '245__c', ['245', '' , '' , 'c'], '245__$$c', ['245', '' , '' , 'c'], '245__$c', ['245', '' , '' , 'c'], '245 $c', ['245', '' , '' , 'c'], '245 $$c', ['245', '' , '' , 'c'], '245__.c', ['245', '' , '' , 'c'], '245 .c', ['245', '' , '' , 'c'], '245C_$c', ['245', 'C', '' , 'c'], '245CO$$c', ['245', 'C', 'O', 'c'], '245CO.c', ['245', 'C', 'O', 'c'], '245$c', ['245', '' , '' , 'c'], '245.c', ['245', '' , '' , 'c'], '245$$c', ['245', '' , '' , 'c'], '245__%', ['245', '' , '' , '%'], '245__$$%', ['245', '' , '' , '%'], '245__$%', ['245', '' , '' , '%'], '245 $%', ['245', '' , '' , '%'], '245 $$%', ['245', '' , '' , '%'], '245$%', ['245', '' , '' , '%'], '245.%', ['245', '' , '' , '%'], '245_O.%', ['245', '' , 'O', '%'], '245.%', ['245', '' , '' , '%'], '245$$%', ['245', '' , '' , '%'], '2%5$$a', ['2%5', '' , '' , 'a'], '2%%%%a', ['2%%', '%', '%', 'a'], '2%%__a', ['2%%', '' , '' , 'a'], '2%%a', ['2%%', '' , '' , 'a']] for i in range(0, len(tags_and_parsed_tags), 2): parsed_tag = bibformat_utils.parse_tag(tags_and_parsed_tags[i]) self.assertEqual(parsed_tag, tags_and_parsed_tags[i+1]) class FormatTest(unittest.TestCase): """ bibformat - generic tests on function that do the formatting. Main functions""" def setUp(self): # pylint: disable-msg=C0103 """ bibformat - prepare BibRecord objects""" self.xml_text_1 = ''' 33 thesis Doe1, John Doe2, John editor On the foo and bar1 On the foo and bar2 99999 ''' #rec_1 = bibrecord.create_record(self.xml_text_1) self.bfo_1 = bibformat_engine.BibFormatObject(recID=None, ln='fr', xml_record=self.xml_text_1) self.xml_text_2 = ''' 33 thesis Doe1, John Doe2, John editor On the foo and bar1 On the foo and bar2 ''' #self.rec_2 = bibrecord.create_record(xml_text_2) self.bfo_2 = bibformat_engine.BibFormatObject(recID=None, ln='fr', xml_record=self.xml_text_2) self.xml_text_3 = ''' 33 eng Doe1, John Doe2, John editor On the foo and bar1 On the foo and bar2 article ''' #self.rec_3 = bibrecord.create_record(xml_text_3) self.bfo_3 = bibformat_engine.BibFormatObject(recID=None, ln='fr', xml_record=self.xml_text_3) def test_decide_format_template(self): """ bibformat - choice made by function decide_format_template""" bibformat_engine.CFG_BIBFORMAT_OUTPUTS_PATH = CFG_BIBFORMAT_OUTPUTS_PATH result = bibformat_engine.decide_format_template(self.bfo_1, "test1") self.assertEqual(result, "Thesis_detailed.bft") result = bibformat_engine.decide_format_template(self.bfo_3, "test3") self.assertEqual(result, "Test3.bft") #Only default matches result = bibformat_engine.decide_format_template(self.bfo_2, "test1") self.assertEqual(result, "Default_HTML_detailed.bft") #No match at all for record result = bibformat_engine.decide_format_template(self.bfo_2, "test2") self.assertEqual(result, None) #Non existing output format result = bibformat_engine.decide_format_template(self.bfo_2, "UNKNOW") self.assertEqual(result, None) def test_format_record(self): """ bibformat - correct formatting""" bibformat_engine.CFG_BIBFORMAT_OUTPUTS_PATH = CFG_BIBFORMAT_OUTPUTS_PATH bibformat_engine.CFG_BIBFORMAT_ELEMENTS_PATH = CFG_BIBFORMAT_ELEMENTS_PATH bibformat_engine.CFG_BIBFORMAT_ELEMENTS_IMPORT_PATH = CFG_BIBFORMAT_ELEMENTS_IMPORT_PATH bibformat_engine.CFG_BIBFORMAT_TEMPLATES_PATH = CFG_BIBFORMAT_TEMPLATES_PATH #use output format that has no match TEST DISABLED DURING MIGRATION #result = bibformat_engine.format_record(recID=None, of="test2", xml_record=self.xml_text_2) #self.assertEqual(result.replace("\n", ""),"") #use output format that link to unknown template result = bibformat_engine.format_record(recID=None, of="test3", xml_record=self.xml_text_2) self.assertEqual(result.replace("\n", ""),"") #Unknown output format TEST DISABLED DURING MIGRATION #result = bibformat_engine.format_record(recID=None, of="unkno", xml_record=self.xml_text_3) #self.assertEqual(result.replace("\n", ""),"") #Default formatting result = bibformat_engine.format_record(recID=None, ln='fr', of="test3", xml_record=self.xml_text_3) self.assertEqual(result,'''

hi

this is my template\ntesttfrgarbage\n
test me!<b>ok</b>a default valueeditor\n
test me!oka default valueeditor\n
test me!<b>ok</b>a default valueeditor\n''') def test_format_with_format_template(self): """ bibformat - correct formatting with given template""" bibformat_engine.CFG_BIBFORMAT_ELEMENTS_PATH = CFG_BIBFORMAT_ELEMENTS_PATH bibformat_engine.CFG_BIBFORMAT_ELEMENTS_IMPORT_PATH = CFG_BIBFORMAT_ELEMENTS_IMPORT_PATH bibformat_engine.CFG_BIBFORMAT_TEMPLATES_PATH = CFG_BIBFORMAT_TEMPLATES_PATH template = bibformat_engine.get_format_template("Test3.bft") result = bibformat_engine.format_with_format_template(format_template_filename = None, bfo=self.bfo_1, verbose=0, format_template_code=template['code']) self.assert_(isinstance(result, tuple)) self.assertEqual(result[0],'''

hi

this is my template\ntesttfrgarbage\n
test me!<b>ok</b>a default valueeditor\n
test me!oka default valueeditor\n
test me!<b>ok</b>a default valueeditor\n99999''') def create_test_suite(): """Return test suite for the bibformat module""" return unittest.TestSuite((unittest.makeSuite(FormatTemplateTest,'test'), unittest.makeSuite(OutputFormatTest,'test'), unittest.makeSuite(FormatElementTest,'test'), unittest.makeSuite(PatternTest,'test'), unittest.makeSuite(MiscTest,'test'), unittest.makeSuite(FormatTest,'test'))) if __name__ == '__main__': unittest.TextTestRunner(verbosity=2).run(create_test_suite()) diff --git a/modules/bibformat/lib/bibformat_templates.py b/modules/bibformat/lib/bibformat_templates.py index a0a29cdeb..44e2de973 100644 --- a/modules/bibformat/lib/bibformat_templates.py +++ b/modules/bibformat/lib/bibformat_templates.py @@ -1,2301 +1,2301 @@ # -*- coding: utf-8 -*- ## ## $Id$ ## ## This file is part of CDS Invenio. ## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 CERN. ## ## CDS Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## CDS Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with CDS Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """HTML Templates for BibFormat administration""" __revision__ = "$Id$" # non Invenio imports import cgi # Invenio imports from invenio.messages import gettext_set_language from invenio.config import weburl, sweburl from invenio.messages import language_list_long from invenio.config import CFG_PATH_PHP class Template: """Templating class, refer to bibformat.py for examples of call""" def tmpl_admin_index(self, ln, warnings, is_admin): """ Returns the main BibFormat admin page. @param ln language @param warnings a list of warnings to display at top of page. None if no warning @param is_admin indicate if user is authorized to use BibFormat @return main BibFormat admin page """ _ = gettext_set_language(ln) # load the right message language out = '' if warnings: out += '''
%(warnings)s
''' % {'warnings': '
'.join(warnings)} out += '''

This is where you can edit the formatting styles available for the records. ''' if not is_admin: out += '''You need to login to enter. ''' % {'weburl':weburl} out += '''

Manage Format Templates
Define how to format a record.
Manage Output Formats
Define which template is applied to which record for a given output.
Manage Knowledge Bases
Define mappings of values, for standardizing records or declaring often used values.

Format Elements Documentation
Documentation of the format elements to be used inside format templates.
BibFormat Admin Guide
Documentation about BibFormat administration
'''% {'weburl':weburl, 'ln':ln} if CFG_PATH_PHP: #Show PHP admin only if PHP is enabled out += '''



Old BibFormat admin interface (in gray box)

The BibFormat admin interface enables you to specify how the bibliographic data is presented to the end user in the search interface and search results pages. For example, you may specify that titles should be printed in bold font, the abstract in small italic, etc. Moreover, the BibFormat is not only a simple bibliographic data output formatter, but also an automated link constructor. For example, from the information on journal name and pages, it may automatically create links to publisher's site based on some configuration rules.

Configuring BibFormat

By default, a simple HTML format based on the most common fields (title, author, abstract, keywords, fulltext link, etc) is defined. You certainly want to define your own ouput formats in case you have a specific metadata structure.

Here is a short guide of what you can configure:

Behaviours
Define one or more output BibFormat behaviours. These are then passed as parameters to the BibFormat modules while executing formatting.
Example: You can tell BibFormat that is has to enrich the incoming metadata file by the created format, or that it only has to print the format out.
Extraction Rules
Define how the metadata tags from input are mapped into internal BibFormat variable names. The variable names can afterwards be used in formatting and linking rules.
Example: You can tell that 100 $a field should be mapped into $100.a internal variable that you could use later.
Link Rules
Define rules for automated creation of URI links from mapped internal variables.
Example: You can tell a rule how to create a link to People database out of the $100.a internal variable repesenting author's name. (The $100.a variable was mapped in the previous step, see the Extraction Rules.)
File Formats
Define file format types based on file extensions. This will be used when proposing various fulltext services.
Example: You can tell that *.pdf files will be treated as PDF files.
User Defined Functions (UDFs)
Define your own functions that you can reuse when creating your own output formats. This enables you to do complex formatting without ever touching the BibFormat core code.
Example: You can define a function how to match and extract email addresses out of a text file.
Formats
Define the output formats, i.e. how to create the output out of internal BibFormat variables that were extracted in a previous step. This is the functionality you would want to configure most of the time. It may reuse formats, user defined functions, knowledge bases, etc.
Example: You can tell that authors should be printed in italic, that if there are more than 10 authors only the first three should be printed, etc.
Knowledge Bases (KBs)
Define one or more knowledge bases that enables you to transform various forms of input data values into the unique standard form on the output.
Example: You can tell that Phys Rev D and Physical Review D are both the same journal and that these names should be standardized to Phys Rev : D.
Execution Test
Enables you to test your formats on your sample data file. Useful when debugging newly created formats.

To learn more on BibFormat configuration, you can consult the BibFormat Admin Guide.

Running BibFormat

From the Web interface

Run Reformat Records tool. This tool permits you to update stored formats for bibliographic records.
It should normally be used after configuring BibFormat's Behaviours and Formats. When these are ready, you can choose to rebuild formats for selected collections or you can manually enter a search query and the web interface will accomplish all necessary formatting steps.
Example: You can request Photo collections to have their HTML brief formats rebuilt, or you can reformat all the records written by Ellis.

From the command-line interface

Consider having an XML MARC data file that is to be uploaded into the CDS Invenio. (For example, it might have been harvested from other sources and processed via BibConvert.) Having configured BibFormat and its default output type behaviour, you would then run this file throught BibFormat as follows:

             $ bibformat < /tmp/sample.xml > /tmp/sample_with_fmt.xml
             
             
that would create default HTML formats and would "enrich" the input XML data file by this format. (You would then continue the upload procedure by calling successively BibUpload and BibWords.) + href="../bibindex/">BibIndex.)

Now consider a different situation. You would like to add a new possible format, say "HTML portfolio" and "HTML captions" in order to nicely format multiple photographs in one page. Let us suppose that these two formats are called hp and hc and are already loaded in the collection_format table. (TODO: describe how this is done via WebAdmin.) You would then proceed as follows: firstly, you would prepare the corresponding output behaviours called HP and HC (TODO: note the uppercase!) that would not enrich the input file but that would produce an XML file with only 001 and FMT tags. (This is in order not to update the bibliographic information but the formats only.) You would also prepare corresponding formats at the same time. Secondly, you would launch the formatting as follows:

             $ bibformat otype=HP,HC < /tmp/sample.xml > /tmp/sample_fmts_only.xml
             
             
that should give you an XML file containing only 001 and FMT tags. Finally, you would upload the formats:
             $ bibupload < /tmp/sample_fmts_only.xml
             
             
and that's it. The new formats should now appear in WebSearch.
''' % {'weburl':weburl, 'ln':ln} return out def tmpl_admin_format_template_show_attributes(self, ln, name, description, filename, editable, all_templates=[], new=False): """ Returns a page to change format template name and description If template is new, offer a way to create a duplicate from an existing template @param ln language @param name the name of the format @param description the description of the format @param filename the filename of the template @param editable True if we let user edit, else False @param all_templates a list of tuples (filename, name) of all other templates @param new if True, the format template has just been added (is new) @return editor for 'format' """ _ = gettext_set_language(ln) # load the right message language out = "" out += '''
%(menu)s
0. %(close_editor)s  1. %(template_editor)s  2. %(modify_template_attributes)s  3. %(check_dependencies)s 

''' % {'ln':ln, 'menu':_("Menu"), 'filename':filename, 'close_editor': _("Close Editor"), 'modify_template_attributes': _("Modify Template Attributes"), 'template_editor': _("Template Editor"), 'check_dependencies': _("Check Dependencies") } disabled = "" readonly = "" if not editable: disabled = 'disabled="disabled"' readonly = 'readonly="readonly"' out += '''
''' % {'ln':ln, 'filename':filename} if new: #Offer the possibility to make a duplicate of existing format template code out += '''
Make a copy of format template: [?]
''' out += ''' ''' % {"name": name, 'ln':ln, 'filename':filename, 'disabled':disabled, 'readonly':readonly, 'name_label': _("Name"), 'weburl':weburl } out += '''
%(name)s attributes [?]
 
''' % {"description": description, 'ln':ln, 'filename':filename, 'disabled':disabled, 'readonly':readonly, 'description_label': _("Description"), 'update_format_attributes': _("Update Format Attributes"), 'weburl':weburl } return out def tmpl_admin_format_template_show_dependencies(self, ln, name, filename, output_formats, format_elements, tags): """ Shows the dependencies (on elements) of the given format. @param name the name of the template @param filename the filename of the template @param format_elements the elements (and list of tags in each element) this template depends on @param output_formats the output format that depend on this template @param tags the tags that are called by format elements this template depends on. """ _ = gettext_set_language(ln) # load the right message language out = '''
%(menu)s
0. %(close_editor)s  1. %(template_editor)s  2. %(modify_template_attributes)s  3. %(check_dependencies)s 
Output Formats that use %(name)s Format Elements used by %(name)s* All Tags Called*
 
''' % {'ln':ln, 'filename':filename, 'menu': _("Menu"), 'close_editor': _("Close Editor"), 'modify_template_attributes': _("Modify Template Attributes"), 'template_editor': _("Template Editor"), 'check_dependencies': _("Check Dependencies"), 'name': name } #Print output formats if len(output_formats) == 0: out += '

No output format uses this format template.

' for output_format in output_formats: name = output_format['names']['generic'] filename = output_format['filename'] out += ''' %(name)s''' % {'filename':filename, 'name':name, 'ln':ln} if len(output_format['tags']) > 0: out += "("+", ".join(output_format['tags'])+")" out += "
" #Print format elements (and tags) out += '
 
' if len(format_elements) == 0: out += '

This format template uses no format element.

' for format_element in format_elements: name = format_element['name'] out += ''' %(name)s''' % {'name':"bfe_"+name.lower(), 'anchor':name.upper(), 'ln':ln} if len(format_element['tags']) > 0: out += "("+", ".join(format_element['tags'])+")" out += "
" #Print tags out += '
 
' if len(tags) == 0: out += '

This format template uses no tag.

' for tag in tags: out += '''%(tag)s
''' % { 'tag':tag} out += '''
*Note: Some tags linked with this format template might not be shown. Check manually. ''' return out def tmpl_admin_format_template_show(self, ln, name, description, code, filename, ln_for_preview, pattern_for_preview, editable, content_type_for_preview, content_types): """ Returns the editor for format templates. Edit 'format' @param ln language @param format the format to edit @param filename the filename of the template @param ln_for_preview the language for the preview (for bfo) @param pattern_for_preview the search pattern to be used for the preview (for bfo) @param editable True if we let user edit, else False @param code the code of the template of the editor @return editor for 'format' """ _ = gettext_set_language(ln) # load the right message language out = "" # If xsl, hide some options in the menu nb_menu_options = 4 if filename.endswith('.xsl'): nb_menu_options = 2 out += ''' ''' % {'ln': ln, 'filename': filename, 'menu': _("Menu"), 'label_show_doc': _("Show Documentation"), 'label_hide_doc': _("Hide Documentation"), 'close_editor': _("Close Editor"), 'modify_template_attributes': _("Modify Template Attributes"), 'template_editor': _("Template Editor"), 'check_dependencies': _("Check Dependencies"), 'nb_menu_options': nb_menu_options, 'weburl': sweburl or weburl } if not filename.endswith('.xsl'): out +=''' ''' % {'ln': ln, 'filename': filename, 'menu': _("Menu"), 'label_show_doc': _("Show Documentation"), 'label_hide_doc': _("Hide Documentation"), 'close_editor': _("Close Editor"), 'modify_template_attributes': _("Modify Template Attributes"), 'template_editor': _("Template Editor"), 'check_dependencies': _("Check Dependencies"), 'weburl': sweburl or weburl } out +='''
%(menu)s
0. %(close_editor)s  1. %(template_editor)s 2. %(modify_template_attributes)s  3. %(check_dependencies)s 
''' % {'ln': ln, 'filename': filename, 'menu': _("Menu"), 'label_show_doc': _("Show Documentation"), 'label_hide_doc': _("Hide Documentation"), 'close_editor': _("Close Editor"), 'modify_template_attributes': _("Modify Template Attributes"), 'template_editor': _("Template Editor"), 'check_dependencies': _("Check Dependencies"), 'weburl': sweburl or weburl } disabled = "" readonly = "" toolbar = """""" % (weburl, ln) if not editable: disabled = 'disabled="disabled"' readonly = 'readonly="readonly"' toolbar = '' #First column: template code and preview out += ''' ''' % {'code':code, 'ln':ln, 'weburl':weburl, 'filename':filename, 'ln_for_preview':ln_for_preview, 'pattern_for_preview':pattern_for_preview } #Second column Print documentation out += '''
Format template code
%(toolbar)s
Preview
   
Elements Documentation
''' % {'weburl':weburl, 'ln':ln} return out def tmpl_admin_format_template_show_short_doc(self, ln, format_elements): """ Prints the format element documentation in a condensed way to display inside format template editor. This page is different from others: it is displayed inside a