diff --git a/INSTALL b/INSTALL index 9ab34edf6..53204197b 100644 --- a/INSTALL +++ b/INSTALL @@ -1,696 +1,694 @@ Invenio INSTALLATION ==================== About ===== This document specifies how to build, customize, and install Invenio v1.0.1 for the first time. See RELEASE-NOTES if you are upgrading from a previous Invenio release. Contents ======== 0. Prerequisites 1. Quick instructions for the impatient Invenio admin 2. Detailed instructions for the patient Invenio admin 0. Prerequisites ================ Here is the software you need to have around before you start installing Invenio: a) Unix-like operating system. The main development and production platforms for Invenio at CERN are GNU/Linux distributions Debian, Gentoo, Scientific Linux (aka RHEL), Ubuntu, but we also develop on Mac OS X. Basically any Unix system supporting the software listed below should do. If you are using Debian GNU/Linux ``Lenny'' or later, then you can install most of the below-mentioned prerequisites and recommendations by running: $ sudo aptitude install python-dev apache2-mpm-prefork \ mysql-server mysql-client python-mysqldb \ python-4suite-xml python-simplejson python-xml \ python-libxml2 python-libxslt1 gnuplot poppler-utils \ gs-common clisp gettext libapache2-mod-wsgi unzip \ python-dateutil python-rdflib \ python-gnuplot python-magic pdftk html2text giflib-tools \ pstotext netpbm You may also want to install some of the following packages, if you have them available on your concrete architecture: - $ sudo aptitude install python-psyco sbcl cmucl pylint \ - pychecker pyflakes python-profiler python-epydoc \ - libapache2-mod-xsendfile openoffice.org + $ sudo aptitude install sbcl cmucl pylint pychecker pyflakes \ + python-profiler python-epydoc libapache2-mod-xsendfile \ + openoffice.org Moreover, you should install some Message Transfer Agent (MTA) such as Postfix so that Invenio can email notification alerts or registration information to the end users, contact moderators and reviewers of submitted documents, inform administrators about various runtime system information, etc: $ sudo aptitude install postfix After running the above-quoted aptitude command(s), you can proceed to configuring your MySQL server instance (max_allowed_packet in my.cnf, see item 0b below) and then to installing the Invenio software package in the section 1 below. If you are using another operating system, then please continue reading the rest of this prerequisites section, and please consult our wiki pages for any concrete hints for your specific operating system. b) MySQL server (may be on a remote machine), and MySQL client (must be available locally too). MySQL versions 4.1 or 5.0 are supported. Please set the variable "max_allowed_packet" in your "my.cnf" init file to at least 4M. (For sites such as INSPIRE, having 1M records with 10M citer-citee pairs in its citation map, you may need to increase max_allowed_packet to 1G.) You may perhaps also want to run your MySQL server natively in UTF-8 mode by setting "default-character-set=utf8" in various parts of your "my.cnf" file, such as in the "[mysql]" part and elsewhere; but this is not really required. c) Apache 2 server, with support for loading DSO modules, and optionally with SSL support for HTTPS-secure user authentication, and mod_xsendfile for off-loading file downloads away from Invenio processes to Apache. d) Python v2.4 or above: as well as the following Python modules: - (mandatory) MySQLdb (version >= 1.2.1_p2; see below) - (recommended) python-dateutil, for complex date processing: - (recommended) PyXML, for XML processing: - (recommended) PyRXP, for very fast XML MARC processing: - (recommended) libxml2-python, for XML/XLST processing: - (recommended) simplejson, for AJAX apps: Note that if you are using Python-2.6, you don't need to install simplejson, because the module is already included in the main Python distribution. - (recommended) Gnuplot.Py, for producing graphs: - (recommended) Snowball Stemmer, for stemming: - (recommended) py-editdist, for record merging: - (recommended) numpy, for citerank methods: - (recommended) magic, for full-text file handling: - (optional) 4suite, slower alternative to PyRXP and libxml2-python: - (optional) feedparser, for web journal creation: - - (optional) Psyco, if you are running on a 32-bit OS: - - (optional) RDFLib, to use RDF ontologies and thesauri: - (optional) mechanize, to run regression web test suite: - (optional) hashlib, needed only for Python-2.4 and only if you would like to use AWS connectivity: Note: MySQLdb version 1.2.1_p2 or higher is recommended. If you are using an older version of MySQLdb, you may get into problems with character encoding. e) mod_wsgi Apache module. Versions 3.x and above are recommended. Note: if you are using Python 2.4 or earlier, then you should also install the wsgiref Python module, available from: (As of Python 2.5 this module is included in standard Python distribution.) f) If you want to be able to extract references from PDF fulltext files, then you need to install pdftotext version 3 at least. g) If you want to be able to search for words in the fulltext files (i.e. to have fulltext indexing) or to stamp submitted files, then you need as well to install some of the following tools: - for Microsoft Office/OpenOffice.org document conversion: OpenOffice.org - for PDF file stamping: pdftk, pdf2ps - for PDF files: pdftotext or pstotext - for PostScript files: pstotext or ps2ascii - for DjVu creation, elaboration: DjVuLibre - to perform OCR: OCRopus (tested only with release 0.3.1) - to perform different image elaborations: ImageMagick - to generate PDF after OCR: ReportLab - to analyze images to generate PDF after OCR: netpbm h) If you have chosen to install fast XML MARC Python processors in the step d) above, then you have to install the parsers themselves: - (optional) 4suite: i) (recommended) Gnuplot, the command-line driven interactive plotting program. It is used to display download and citation history graphs on the Detailed record pages on the web interface. Note that Gnuplot must be compiled with PNG output support, that is, with the GD library. Note also that Gnuplot is not required, only recommended. j) (recommended) A Common Lisp implementation, such as CLISP, SBCL or CMUCL. It is used for the web server log analysing tool and the metadata checking program. Note that any of the three implementations CLISP, SBCL, or CMUCL will do. CMUCL produces fastest machine code, but it does not support UTF-8 yet. Pick up CLISP if you don't know what to do. Note that a Common Lisp implementation is not required, only recommended. k) GNU gettext, a set of tools that makes it possible to translate the application in multiple languages. This is available by default on many systems. Note that the configure script checks whether you have all the prerequisite software installed and that it won't let you continue unless everything is in order. It also warns you if it cannot find some optional but recommended software. 1. Quick instructions for the impatient Invenio admin ========================================================= 1a. Installation ---------------- $ cd $HOME/src/ $ wget http://invenio-software.org/download/invenio-1.0.1.tar.gz $ wget http://invenio-software.org/download/invenio-1.0.1.tar.gz.md5 $ wget http://invenio-software.org/download/invenio-1.0.1.tar.gz.sig $ md5sum -c invenio-1.0.1.tar.gz.md5 $ gpg --verify invenio-1.0.1.tar.gz.sig invenio-1.0.1.tar.gz $ tar xvfz invenio-1.0.1.tar.gz $ cd invenio-1.0.1 $ ./configure $ make $ make install $ make install-mathjax-plugin ## optional $ make install-jquery-plugins ## optional $ make install-fckeditor-plugin ## optional $ make install-pdfa-helper-files ## optional 1b. Configuration ----------------- $ sudo chown -R www-data.www-data /opt/invenio $ sudo -u www-data emacs /opt/invenio/etc/invenio-local.conf $ sudo -u www-data /opt/invenio/bin/inveniocfg --update-all $ sudo -u www-data /opt/invenio/bin/inveniocfg --create-tables $ sudo -u www-data /opt/invenio/bin/inveniocfg --load-webstat-conf $ sudo -u www-data /opt/invenio/bin/inveniocfg --create-apache-conf $ sudo /etc/init.d/apache2 restart $ sudo -u www-data /opt/invenio/bin/inveniocfg --create-demo-site $ sudo -u www-data /opt/invenio/bin/inveniocfg --load-demo-records $ sudo -u www-data /opt/invenio/bin/inveniocfg --run-unit-tests $ sudo -u www-data /opt/invenio/bin/inveniocfg --run-regression-tests $ sudo -u www-data /opt/invenio/bin/inveniocfg --run-web-tests $ sudo -u www-data /opt/invenio/bin/inveniocfg --remove-demo-records $ sudo -u www-data /opt/invenio/bin/inveniocfg --drop-demo-site $ firefox http://your.site.com/help/admin/howto-run 2. Detailed instructions for the patient Invenio admin ========================================================== 2a. Installation ---------------- The Invenio uses standard GNU autoconf method to build and install its files. This means that you proceed as follows: $ cd $HOME/src/ Change to a directory where we will build the Invenio sources. (The built files will be installed into different "target" directories later.) $ wget http://invenio-software.org/download/invenio-1.0.1.tar.gz $ wget http://invenio-software.org/download/invenio-1.0.1.tar.gz.md5 $ wget http://invenio-software.org/download/invenio-1.0.1.tar.gz.sig Fetch Invenio source tarball from the distribution server, together with MD5 checksum and GnuPG cryptographic signature files useful for verifying the integrity of the tarball. $ md5sum -c invenio-1.0.1.tar.gz.md5 Verify MD5 checksum. $ gpg --verify invenio-1.0.1.tar.gz.sig invenio-1.0.1.tar.gz Verify GnuPG cryptographic signature. Note that you may first have to import my public key into your keyring, if you haven't done that already: $ gpg --keyserver wwwkeys.eu.pgp.net --recv-keys 0xBA5A2B67 The output of the gpg --verify command should then read: Good signature from "Tibor Simko " You can safely ignore any trusted signature certification warning that may follow after the signature has been successfully verified. $ tar xvfz invenio-1.0.1.tar.gz Untar the distribution tarball. $ cd invenio-1.0.1 Go to the source directory. $ ./configure Configure Invenio software for building on this specific platform. You can use the following optional parameters: --prefix=/opt/invenio Optionally, specify the Invenio general installation directory (default is /opt/invenio). It will contain command-line binaries and program libraries containing the core Invenio functionality, but also store web pages, runtime log and cache information, document data files, etc. Several subdirs like `bin', `etc', `lib', or `var' will be created inside the prefix directory to this effect. Note that the prefix directory should be chosen outside of the Apache htdocs tree, since only one its subdirectory (prefix/var/www) is to be accessible directly via the Web (see below). Note that Invenio won't install to any other directory but to the prefix mentioned in this configuration line. --with-python=/opt/python/bin/python2.4 Optionally, specify a path to some specific Python binary. This is useful if you have more than one Python installation on your system. If you don't set this option, then the first Python that will be found in your PATH will be chosen for running Invenio. --with-mysql=/opt/mysql/bin/mysql Optionally, specify a path to some specific MySQL client binary. This is useful if you have more than one MySQL installation on your system. If you don't set this option, then the first MySQL client executable that will be found in your PATH will be chosen for running Invenio. --with-clisp=/opt/clisp/bin/clisp Optionally, specify a path to CLISP executable. This is useful if you have more than one CLISP installation on your system. If you don't set this option, then the first executable that will be found in your PATH will be chosen for running Invenio. --with-cmucl=/opt/cmucl/bin/lisp Optionally, specify a path to CMUCL executable. This is useful if you have more than one CMUCL installation on your system. If you don't set this option, then the first executable that will be found in your PATH will be chosen for running Invenio. --with-sbcl=/opt/sbcl/bin/sbcl Optionally, specify a path to SBCL executable. This is useful if you have more than one SBCL installation on your system. If you don't set this option, then the first executable that will be found in your PATH will be chosen for running Invenio. --with-openoffice-python Optionally, specify the path to the Python interpreter embedded with OpenOffice.org. This is normally not contained in the normal path. If you don't specify this it won't be possible to use OpenOffice.org to convert from and to Microsoft Office and OpenOffice.org documents. This configuration step is mandatory. Usually, you do this step only once. (Note that if you are building Invenio not from a released tarball, but from the Git sources, then you have to generate the configure file via autotools: $ sudo aptitude install automake1.9 autoconf $ aclocal-1.9 $ automake-1.9 -a $ autoconf after which you proceed with the usual configure command.) $ make Launch the Invenio build. Since many messages are printed during the build process, you may want to run it in a fast-scrolling terminal such as rxvt or in a detached screen session. During this step all the pages and scripts will be pre-created and customized based on the config you have edited in the previous step. Note that on systems such as FreeBSD or Mac OS X you have to use GNU make ("gmake") instead of "make". $ make install Install the web pages, scripts, utilities and everything needed for Invenio runtime into respective installation directories, as specified earlier by the configure command. Note that if you are installing Invenio for the first time, you will be asked to create symbolic link(s) from Python's site-packages system-wide directory(ies) to the installation location. This is in order to instruct Python where to find Invenio's Python files. You will be hinted as to the exact command to use based on the parameters you have used in the configure command. $ make install-mathjax-plugin ## optional This will automatically download and install in the proper place MathJax, a JavaScript library to render LaTeX formulas in the client browser. Note that in order to enable the rendering you will have to set the variable CFG_WEBSEARCH_USE_MATHJAX_FOR_FORMATS in invenio-local.conf to a suitable list of output format codes. For example: CFG_WEBSEARCH_USE_MATHJAX_FOR_FORMATS = hd,hb $ make install-jquery-plugins ## optional This will automatically download and install in the proper place jQuery and related plugins. They are used for AJAX applications such as the record editor. Note that `unzip' is needed when installing jquery plugins. $ make install-fckeditor-plugin ## optional This will automatically download and install in the proper place FCKeditor, a WYSIWYG Javascript-based editor (e.g. for the WebComment module). Note that in order to enable the editor you have to set the CFG_WEBCOMMENT_USE_FCKEDITOR to True. $ make install-pdfa-helper-files ## optional This will automatically download and install in the proper place the helper files needed to create PDF/A files out of existing PDF files. 2b. Configuration ----------------- Once the basic software installation is done, we proceed to configuring your Invenio system. $ sudo chown -R www-data.www-data /opt/invenio For the sake of simplicity, let us assume that your Invenio installation will run under the `www-data' user process identity. The above command changes ownership of installed files to www-data, so that we shall run everything under this user identity from now on. For production purposes, you would typically enable Apache server to read all files from the installation place but to write only to the `var' subdirectory of your installation place. You could achieve this by configuring Unix directory group permissions, for example. $ sudo -u www-data emacs /opt/invenio/etc/invenio-local.conf Customize your Invenio installation. Please read the 'invenio.conf' file located in the same directory that contains the vanilla default configuration parameters of your Invenio installation. If you want to customize some of these parameters, you should create a file named 'invenio-local.conf' in the same directory where 'invenio.conf' lives and you should write there only the customizations that you want to be different from the vanilla defaults. Here is a realistic, minimalist, yet production-ready example of what you would typically put there: $ cat /opt/invenio/etc/invenio-local.conf [Invenio] CFG_SITE_NAME = John Doe's Document Server CFG_SITE_NAME_INTL_fr = Serveur des Documents de John Doe CFG_SITE_URL = http://your.site.com CFG_SITE_SECURE_URL = https://your.site.com CFG_SITE_ADMIN_EMAIL = john.doe@your.site.com CFG_SITE_SUPPORT_EMAIL = john.doe@your.site.com CFG_WEBALERT_ALERT_ENGINE_EMAIL = john.doe@your.site.com CFG_WEBCOMMENT_ALERT_ENGINE_EMAIL = john.doe@your.site.com CFG_WEBCOMMENT_DEFAULT_MODERATOR = john.doe@your.site.com CFG_DATABASE_HOST = localhost CFG_DATABASE_NAME = invenio CFG_DATABASE_USER = invenio CFG_DATABASE_PASS = my123p$ss You should override at least the parameters mentioned above in order to define some very essential runtime parameters such as the name of your document server (CFG_SITE_NAME and CFG_SITE_NAME_INTL_*), the visible URL of your document server (CFG_SITE_URL and CFG_SITE_SECURE_URL), the email address of the local Invenio administrator, comment moderator, and alert engine (CFG_SITE_SUPPORT_EMAIL, CFG_SITE_ADMIN_EMAIL, etc), and last but not least your database credentials (CFG_DATABASE_*). The Invenio system will then read both the default invenio.conf file and your customized invenio-local.conf file and it will override any default options with the ones you have specifield in your local file. This cascading of configuration parameters will ease your future upgrades. $ sudo -u www-data /opt/invenio/bin/inveniocfg --update-all Make the rest of the Invenio system aware of your invenio-local.conf changes. This step is mandatory each time you edit your conf files. $ sudo -u www-data /opt/invenio/bin/inveniocfg --create-tables If you are installing Invenio for the first time, you have to create database tables. Note that this step checks for potential problems such as the database connection rights and may ask you to perform some more administrative steps in case it detects a problem. Notably, it may ask you to set up database access permissions, based on your configure values. If you are installing Invenio for the first time, you have to create a dedicated database on your MySQL server that the Invenio can use for its purposes. Please contact your MySQL administrator and ask him to execute the commands this step proposes you. At this point you should now have successfully completed the "make install" process. We continue by setting up the Apache web server. $ sudo -u www-data /opt/invenio/bin/inveniocfg --load-webstat-conf Load the configuration file of webstat module. It will create the tables in the database for register customevents, such as basket hits. $ sudo -u www-data /opt/invenio/bin/inveniocfg --create-apache-conf Running this command will generate Apache virtual host configurations matching your installation. You will be instructed to check created files (usually they are located under /opt/invenio/etc/apache/) and edit your httpd.conf to activate Invenio virtual hosts. If you are using Debian GNU/Linux ``Lenny'' or later, then you can do the following to create your SSL certificate and to activate your Invenio vhosts: ## make SSL certificate: $ sudo aptitude install ssl-cert $ sudo mkdir /etc/apache2/ssl $ sudo /usr/sbin/make-ssl-cert /usr/share/ssl-cert/ssleay.cnf \ /etc/apache2/ssl/apache.pem ## add Invenio web sites: $ sudo ln -s /opt/invenio/etc/apache/invenio-apache-vhost.conf \ /etc/apache2/sites-available/invenio $ sudo ln -s /opt/invenio/etc/apache/invenio-apache-vhost-ssl.conf \ /etc/apache2/sites-available/invenio-ssl ## disable Debian's default web site: $ sudo /usr/sbin/a2dissite default ## enable Invenio web sites: $ sudo /usr/sbin/a2ensite invenio $ sudo /usr/sbin/a2ensite invenio-ssl ## enable SSL module: $ sudo /usr/sbin/a2enmod ssl ## if you are using xsendfile module, enable it too: $ sudo /usr/sbin/a2enmod xsendfile If you are using another operating system, you should do the equivalent, for example edit your system-wide httpd.conf and put the following include statements: Include /opt/invenio/etc/apache/invenio-apache-vhost.conf Include /opt/invenio/etc/apache/invenio-apache-vhost-ssl.conf Note that you may need to adapt generated vhost file snippets to match your concrete operating system specifics. For example, the generated configuration snippet will preload Invenio WSGI daemon application upon Apache start up for faster site response. The generated configuration assumes that you are using mod_wsgi version 3 or later. If you are using the old legacy mod_wsgi version 2, then you would need to comment out the WSGIImportScript directive from the generated snippet, or else move the WSGI daemon setup to the top level, outside of the VirtualHost section. Note also that you may want to tweak the generated Apache vhost snippet for performance reasons, especially with respect to WSGIDaemonProcess parameters. For example, you can increase the number of processes from the default value `processes=5' if you have lots of RAM and if many concurrent users may access your site in parallel. However, note that you must use `threads=1' there, because Invenio WSGI daemon processes are not fully thread safe yet. This may change in the future. $ sudo /etc/init.d/apache2 restart Please ask your webserver administrator to restart the Apache server after the above "httpd.conf" changes. $ sudo -u www-data /opt/invenio/bin/inveniocfg --create-demo-site This step is recommended to test your local Invenio installation. It should give you our "Atlantis Institute of Science" demo installation, exactly as you see it at . $ sudo -u www-data /opt/invenio/bin/inveniocfg --load-demo-records Optionally, load some demo records to be able to test indexing and searching of your local Invenio demo installation. $ sudo -u www-data /opt/invenio/bin/inveniocfg --run-unit-tests Optionally, you can run the unit test suite to verify the unit behaviour of your local Invenio installation. Note that this command should be run only after you have installed the whole system via `make install'. $ sudo -u www-data /opt/invenio/bin/inveniocfg --run-regression-tests Optionally, you can run the full regression test suite to verify the functional behaviour of your local Invenio installation. Note that this command requires to have created the demo site and loaded the demo records. Note also that running the regression test suite may alter the database content with junk data, so that rebuilding the demo site is strongly recommended afterwards. $ sudo -u www-data /opt/invenio/bin/inveniocfg --run-web-tests Optionally, you can run additional automated web tests running in a real browser. This requires to have Firefox with the Selenium IDE extension installed. $ sudo -u www-data /opt/invenio/bin/inveniocfg --remove-demo-records Optionally, remove the demo records loaded in the previous step, but keeping otherwise the demo collection, submission, format, and other configurations that you may reuse and modify for your own production purposes. $ sudo -u www-data /opt/invenio/bin/inveniocfg --drop-demo-site Optionally, drop also all the demo configuration so that you'll end up with a completely blank Invenio system. However, you may want to find it more practical not to drop the demo site configuration but to start customizing from there. $ firefox http://your.site.com/help/admin/howto-run In order to start using your Invenio installation, you can start indexing, formatting and other daemons as indicated in the "HOWTO Run" guide on the above URL. You can also use the Admin Area web interfaces to perform further runtime configurations such as the definition of data collections, document types, document formats, word indexes, etc. $ sudo ln -s /opt/invenio/etc/bash_completion.d/inveniocfg \ /etc/bash_completion.d/inveniocfg Optionally, if you are using Bash shell completion, then you may want to create the above symlink in order to configure completion for the inveniocfg command. Good luck, and thanks for choosing Invenio. - Invenio Development Team diff --git a/configure-tests.py b/configure-tests.py index fb8c813b6..87419e3aa 100644 --- a/configure-tests.py +++ b/configure-tests.py @@ -1,368 +1,343 @@ ## This file is part of Invenio. ## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """ Test the suitability of Python core and the availability of various Python modules for running Invenio. Warn the user if there are eventual troubles. Exit status: 0 if okay, 1 if not okay. Useful for running from configure.ac. """ ## minimally recommended/required versions: cfg_min_python_version = "2.4" cfg_max_python_version = "2.9.9999" cfg_min_mysqldb_version = "1.2.1_p2" ## 0) import modules needed for this testing: import string import sys import getpass def wait_for_user(msg): """Print MSG and prompt user for confirmation.""" try: raw_input(msg) except KeyboardInterrupt: print "\n\nInstallation aborted." sys.exit(1) except EOFError: print " (continuing in batch mode)" return ## 1) check Python version: if sys.version < cfg_min_python_version: print """ ******************************************************* ** ERROR: TOO OLD PYTHON DETECTED: %s ******************************************************* ** You seem to be using a too old version of Python. ** ** You must use at least Python %s. ** ** ** ** Note that if you have more than one Python ** ** installed on your system, you can specify the ** ** --with-python configuration option to choose ** ** a specific (e.g. non system wide) Python binary. ** ** ** ** Please upgrade your Python before continuing. ** ******************************************************* """ % (string.replace(sys.version, "\n", ""), cfg_min_python_version) sys.exit(1) if sys.version > cfg_max_python_version: print """ ******************************************************* ** ERROR: TOO NEW PYTHON DETECTED: %s ******************************************************* ** You seem to be using a too new version of Python. ** ** You must use at most Python %s. ** ** ** ** Perhaps you have downloaded and are installing an ** ** old Invenio version? Please look for more recent ** ** Invenio version or please contact the development ** ** team at about this ** ** problem. ** ** ** ** Installation aborted. ** ******************************************************* """ % (string.replace(sys.version, "\n", ""), cfg_max_python_version) sys.exit(1) ## 2) check for required modules: try: import MySQLdb import base64 import cPickle import cStringIO import cgi import copy import fileinput import getopt import sys if sys.hexversion < 0x2060000: import md5 else: import hashlib import marshal import os import signal import tempfile import time import traceback import unicodedata import urllib import zlib import wsgiref except ImportError, msg: print """ ************************************************* ** IMPORT ERROR %s ************************************************* ** Perhaps you forgot to install some of the ** ** prerequisite Python modules? Please look ** ** at our INSTALL file for more details and ** ** fix the problem before continuing! ** ************************************************* """ % msg sys.exit(1) ## 3) check for recommended modules: -try: - if (2**31 - 1) == sys.maxint: - # check for Psyco since we seem to run in 32-bit environment - import psyco - else: - # no need to advise on Psyco on 64-bit systems - pass -except ImportError, msg: - print """ - ***************************************************** - ** IMPORT WARNING %s - ***************************************************** - ** Note that Psyco is not really required but we ** - ** recommend it for faster Invenio operation ** - ** if you are running in 32-bit operating system. ** - ** ** - ** You can safely continue installing Invenio ** - ** now, and add this module anytime later. (I.e. ** - ** even after your Invenio installation is put ** - ** into production.) ** - ***************************************************** - """ % msg - - wait_for_user("Press ENTER to continue the installation...") - try: import rdflib except ImportError, msg: print """ ***************************************************** ** IMPORT WARNING %s ***************************************************** ** Note that rdflib is needed only if you plan ** ** to work with the automatic classification of ** ** documents based on RDF-based taxonomies. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg wait_for_user("Press ENTER to continue the installation...") try: import pyRXP except ImportError, msg: print """ ***************************************************** ** IMPORT WARNING %s ***************************************************** ** Note that PyRXP is not really required but ** ** we recommend it for fast XML MARC parsing. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg wait_for_user("Press ENTER to continue the installation...") try: import dateutil except ImportError, msg: print """ ***************************************************** ** IMPORT WARNING %s ***************************************************** ** Note that dateutil is not really required but ** ** we recommend it for user-friendly date ** ** parsing. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg wait_for_user("Press ENTER to continue the installation...") try: import libxml2 except ImportError, msg: print """ ***************************************************** ** IMPORT WARNING %s ***************************************************** ** Note that libxml2 is not really required but ** ** we recommend it for XML metadata conversions ** ** and for fast XML parsing. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg wait_for_user("Press ENTER to continue the installation...") try: import libxslt except ImportError, msg: print """ ***************************************************** ** IMPORT WARNING %s ***************************************************** ** Note that libxslt is not really required but ** ** we recommend it for XML metadata conversions. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg wait_for_user("Press ENTER to continue the installation...") try: import Gnuplot except ImportError, msg: print """ ***************************************************** ** IMPORT WARNING %s ***************************************************** ** Note that Gnuplot.py is not really required but ** ** we recommend it in order to have nice download ** ** and citation history graphs on Detailed record ** ** pages. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg wait_for_user("Press ENTER to continue the installation...") try: import magic if not hasattr(magic, "open"): raise StandardError except ImportError, msg: print """ ***************************************************** ** IMPORT WARNING %s ***************************************************** ** Note that magic module is not really required ** ** but we recommend it in order to have detailed ** ** content information about fulltext files. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg except StandardError: print """ ***************************************************** ** IMPORT WARNING python-magic ***************************************************** ** The python-magic package you installed is not ** ** the one supported by Invenio. Please refer to ** ** the INSTALL file for more details. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ try: import reportlab except ImportError, msg: print """ ***************************************************** ** IMPORT WARNING %s ***************************************************** ** Note that reportlab module is not really ** ** required, but we recommend it you want to ** ** enrich PDF with OCR information. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg wait_for_user("Press ENTER to continue the installation...") ## 4) check for versions of some important modules: if MySQLdb.__version__ < cfg_min_mysqldb_version: print """ ***************************************************** ** ERROR: PYTHON MODULE MYSQLDB %s DETECTED ***************************************************** ** You have to upgrade your MySQLdb to at least ** ** version %s. You must fix this problem ** ** before continuing. Please see the INSTALL file ** ** for more details. ** ***************************************************** """ % (MySQLdb.__version__, cfg_min_mysqldb_version) sys.exit(1) try: import Stemmer try: from Stemmer import algorithms except ImportError, msg: print """ ***************************************************** ** ERROR: STEMMER MODULE PROBLEM %s ***************************************************** ** Perhaps you are using an old Stemmer version? ** ** You must either remove your old Stemmer or else ** ** upgrade to Snowball Stemmer ** ** before continuing. Please see the INSTALL file ** ** for more details. ** ***************************************************** """ % (msg) sys.exit(1) except ImportError: pass # no prob, Stemmer is optional ## 5) check for Python.h (needed for intbitset): try: from distutils.sysconfig import get_python_inc path_to_python_h = get_python_inc() + os.sep + 'Python.h' if not os.path.exists(path_to_python_h): raise StandardError, "Cannot find %s" % path_to_python_h except StandardError, msg: print """ ***************************************************** ** ERROR: PYTHON HEADER FILE ERROR %s ***************************************************** ** You do not seem to have Python developer files ** ** installed (such as Python.h). Some operating ** ** systems provide these in a separate Python ** ** package called python-dev or python-devel. ** ** You must install such a package before ** ** continuing the installation process. ** ***************************************************** """ % (msg) sys.exit(1) diff --git a/modules/bibedit/lib/bibrecord.py b/modules/bibedit/lib/bibrecord.py index 20dc3463f..f3080d958 100644 --- a/modules/bibedit/lib/bibrecord.py +++ b/modules/bibedit/lib/bibrecord.py @@ -1,1540 +1,1523 @@ # -*- coding: utf-8 -*- ## ## This file is part of Invenio. ## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """BibRecord - XML MARC processing library for Invenio. For API, see create_record(), record_get_field_instances() and friends in the source code of this file in the section entitled INTERFACE. Note: Does not access the database, the input is MARCXML only.""" ### IMPORT INTERESTING MODULES AND XML PARSERS import re import sys -try: - import psyco - PSYCO_AVAILABLE = True -except ImportError: - PSYCO_AVAILABLE = False if sys.hexversion < 0x2040000: # pylint: disable=W0622 from sets import Set as set # pylint: enable=W0622 from invenio.bibrecord_config import CFG_MARC21_DTD, \ CFG_BIBRECORD_WARNING_MSGS, CFG_BIBRECORD_DEFAULT_VERBOSE_LEVEL, \ CFG_BIBRECORD_DEFAULT_CORRECT, CFG_BIBRECORD_PARSERS_AVAILABLE, \ InvenioBibRecordParserError, InvenioBibRecordFieldError from invenio.config import CFG_BIBUPLOAD_EXTERNAL_OAIID_TAG from invenio.textutils import encode_for_xml # Some values used for the RXP parsing. TAG, ATTRS, CHILDREN = 0, 1, 2 # Find out about the best usable parser: AVAILABLE_PARSERS = [] # Do we remove singletons (empty tags)? # NOTE: this is currently set to True as there are some external workflow # exploiting singletons, e.g. bibupload -c used to delete fields, and # bibdocfile --fix-marc called on a record where the latest document # has been deleted. CFG_BIBRECORD_KEEP_SINGLETONS = True try: import pyRXP if 'pyrxp' in CFG_BIBRECORD_PARSERS_AVAILABLE: AVAILABLE_PARSERS.append('pyrxp') except ImportError: pass try: import Ft.Xml.Domlette if '4suite' in CFG_BIBRECORD_PARSERS_AVAILABLE: AVAILABLE_PARSERS.append('4suite') except ImportError: pass try: import xml.dom.minidom import xml.parsers.expat if 'minidom' in CFG_BIBRECORD_PARSERS_AVAILABLE: AVAILABLE_PARSERS.append('minidom') except ImportError: pass ### INTERFACE / VISIBLE FUNCTIONS def create_field(subfields=None, ind1=' ', ind2=' ', controlfield_value='', global_position=-1): """ Returns a field created with the provided elements. Global position is set arbitrary to -1.""" if subfields is None: subfields = [] ind1, ind2 = _wash_indicators(ind1, ind2) field = (subfields, ind1, ind2, controlfield_value, global_position) _check_field_validity(field) return field def create_records(marcxml, verbose=CFG_BIBRECORD_DEFAULT_VERBOSE_LEVEL, correct=CFG_BIBRECORD_DEFAULT_CORRECT, parser='', keep_singletons=CFG_BIBRECORD_KEEP_SINGLETONS): """Creates a list of records from the marcxml description. Returns a list of objects initiated by the function create_record(). Please see that function's docstring.""" # Use the DOTALL flag to include newlines. regex = re.compile('.*?', re.DOTALL) record_xmls = regex.findall(marcxml) return [create_record(record_xml, verbose=verbose, correct=correct, parser=parser, keep_singletons=keep_singletons) for record_xml in record_xmls] def create_record(marcxml, verbose=CFG_BIBRECORD_DEFAULT_VERBOSE_LEVEL, correct=CFG_BIBRECORD_DEFAULT_CORRECT, parser='', sort_fields_by_indicators=False, keep_singletons=CFG_BIBRECORD_KEEP_SINGLETONS): """Creates a record object from the marcxml description. Uses the best parser available in CFG_BIBRECORD_PARSERS_AVAILABLE or the parser specified. The returned object is a tuple (record, status_code, list_of_errors), where status_code is 0 when there are errors, 1 when no errors. The return record structure is as follows: Record := {tag : [Field]} Field := (Subfields, ind1, ind2, value) Subfields := [(code, value)] For example: ______ |record| ------ __________________________|_______________________________________ |record['001'] |record['909'] |record['520'] | | | | | [list of fields] [list of fields] [list of fields] ... | ______|______________ | |[0] |[0] |[1] | |[0] ___|_____ _____|___ ___|_____ ... ____|____ |Field 001| |Field 909| |Field 909| |Field 520| --------- --------- --------- --------- | _______________|_________________ | | ... |[0] |[1] |[2] | ... ... | | | | [list of subfields] 'C' '4' ___|__________________________________________ | | | ('a', 'value') ('b', 'value for subfield b') ('a', 'value for another a') @param marcxml: an XML string representation of the record to create @param verbose: the level of verbosity: 0 (silent), 1-2 (warnings), 3(strict:stop when errors) @param correct: 1 to enable correction of marcxml syntax. Else 0. @return: a tuple (record, status_code, list_of_errors), where status code is 0 where there are errors, 1 when no errors""" # Select the appropriate parser. parser = _select_parser(parser) try: if parser == 'pyrxp': rec = _create_record_rxp(marcxml, verbose, correct, keep_singletons=keep_singletons) elif parser == '4suite': rec = _create_record_4suite(marcxml, keep_singletons=keep_singletons) elif parser == 'minidom': rec = _create_record_minidom(marcxml, keep_singletons=keep_singletons) except InvenioBibRecordParserError, ex1: return (None, 0, str(ex1)) # _create_record = { # 'pyrxp': _create_record_rxp, # '4suite': _create_record_4suite, # 'minidom': _create_record_minidom, # } # try: # rec = _create_record[parser](marcxml, verbose) # except InvenioBibRecordParserError, ex1: # return (None, 0, str(ex1)) if sort_fields_by_indicators: _record_sort_by_indicators(rec) errs = [] if correct: # Correct the structure of the record. errs = _correct_record(rec) return (rec, int(not errs), errs) def record_get_field_instances(rec, tag="", ind1=" ", ind2=" "): """Returns the list of field instances for the specified tag and indicators of the record (rec). Returns empty list if not found. If tag is empty string, returns all fields Parameters (tag, ind1, ind2) can contain wildcard %. @param rec: a record structure as returned by create_record() @param tag: a 3 characters long string @param ind1: a 1 character long string @param ind2: a 1 character long string @param code: a 1 character long string @return: a list of field tuples (Subfields, ind1, ind2, value, field_position_global) where subfields is list of (code, value)""" if not rec: return [] if not tag: return rec.items() else: out = [] ind1, ind2 = _wash_indicators(ind1, ind2) if '%' in tag: # Wildcard in tag. Check all possible for field_tag in rec: if _tag_matches_pattern(field_tag, tag): for possible_field_instance in rec[field_tag]: if (ind1 in ('%', possible_field_instance[1]) and ind2 in ('%', possible_field_instance[2])): out.append(possible_field_instance) else: # Completely defined tag. Use dict for possible_field_instance in rec.get(tag, []): if (ind1 in ('%', possible_field_instance[1]) and ind2 in ('%', possible_field_instance[2])): out.append(possible_field_instance) return out def record_add_field(rec, tag, ind1=' ', ind2=' ', controlfield_value='', subfields=None, field_position_global=None, field_position_local=None): """ Adds a new field into the record. If field_position_global or field_position_local is specified then this method will insert the new field at the desired position. Otherwise a global field position will be computed in order to insert the field at the best position (first we try to keep the order of the tags and then we insert the field at the end of the fields with the same tag). If both field_position_global and field_position_local are present, then field_position_local takes precedence. @param rec: the record data structure @param tag: the tag of the field to be added @param ind1: the first indicator @param ind2: the second indicator @param controlfield_value: the value of the controlfield @param subfields: the subfields (a list of tuples (code, value)) @param field_position_global: the global field position (record wise) @param field_position_local: the local field position (tag wise) @return: the global field position of the newly inserted field or -1 if the operation failed """ error = validate_record_field_positions_global(rec) if error: # FIXME one should write a message here pass # Clean the parameters. if subfields is None: subfields = [] ind1, ind2 = _wash_indicators(ind1, ind2) if controlfield_value and (ind1 != ' ' or ind2 != ' ' or subfields): return -1 # Detect field number to be used for insertion: # Dictionaries for uniqueness. tag_field_positions_global = {}.fromkeys([field[4] for field in rec.get(tag, [])]) all_field_positions_global = {}.fromkeys([field[4] for fields in rec.values() for field in fields]) if field_position_global is None and field_position_local is None: # Let's determine the global field position of the new field. if tag in rec: try: field_position_global = max([field[4] for field in rec[tag]]) \ + 1 except IndexError: if tag_field_positions_global: field_position_global = max(tag_field_positions_global) + 1 elif all_field_positions_global: field_position_global = max(all_field_positions_global) + 1 else: field_position_global = 1 else: if tag in ('FMT', 'FFT'): # Add the new tag to the end of the record. if tag_field_positions_global: field_position_global = max(tag_field_positions_global) + 1 elif all_field_positions_global: field_position_global = max(all_field_positions_global) + 1 else: field_position_global = 1 else: # Insert the tag in an ordered way by selecting the # right global field position. immediate_lower_tag = '000' for rec_tag in rec: if (tag not in ('FMT', 'FFT') and immediate_lower_tag < rec_tag < tag): immediate_lower_tag = rec_tag if immediate_lower_tag == '000': field_position_global = 1 else: field_position_global = rec[immediate_lower_tag][-1][4] + 1 field_position_local = len(rec.get(tag, [])) _shift_field_positions_global(rec, field_position_global, 1) elif field_position_local is not None: if tag in rec: if field_position_local >= len(rec[tag]): field_position_global = rec[tag][-1][4] + 1 else: field_position_global = rec[tag][field_position_local][4] _shift_field_positions_global(rec, field_position_global, 1) else: if all_field_positions_global: field_position_global = max(all_field_positions_global) + 1 else: # Empty record. field_position_global = 1 elif field_position_global is not None: # If the user chose an existing global field position, shift all the # global field positions greater than the input global field position. if tag not in rec: if all_field_positions_global: field_position_global = max(all_field_positions_global) + 1 else: field_position_global = 1 field_position_local = 0 elif field_position_global < min(tag_field_positions_global): field_position_global = min(tag_field_positions_global) _shift_field_positions_global(rec, min(tag_field_positions_global), 1) field_position_local = 0 elif field_position_global > max(tag_field_positions_global): field_position_global = max(tag_field_positions_global) + 1 _shift_field_positions_global(rec, max(tag_field_positions_global) + 1, 1) field_position_local = len(rec.get(tag, [])) else: if field_position_global in tag_field_positions_global: _shift_field_positions_global(rec, field_position_global, 1) field_position_local = 0 for position, field in enumerate(rec[tag]): if field[4] == field_position_global + 1: field_position_local = position # Create the new field. newfield = (subfields, ind1, ind2, str(controlfield_value), field_position_global) rec.setdefault(tag, []).insert(field_position_local, newfield) # Return new field number: return field_position_global def record_has_field(rec, tag): """ Checks if the tag exists in the record. @param rec: the record data structure @param the: field @return: a boolean """ return tag in rec def record_delete_field(rec, tag, ind1=' ', ind2=' ', field_position_global=None, field_position_local=None): """ If global field position is specified, deletes the field with the corresponding global field position. If field_position_local is specified, deletes the field with the corresponding local field position and tag. Else deletes all the fields matching tag and optionally ind1 and ind2. If both field_position_global and field_position_local are present, then field_position_local takes precedence. @param rec: the record data structure @param tag: the tag of the field to be deleted @param ind1: the first indicator of the field to be deleted @param ind2: the second indicator of the field to be deleted @param field_position_global: the global field position (record wise) @param field_position_local: the local field position (tag wise) @return: the list of deleted fields """ error = validate_record_field_positions_global(rec) if error: # FIXME one should write a message here. pass if tag not in rec: return False ind1, ind2 = _wash_indicators(ind1, ind2) deleted = [] newfields = [] if field_position_global is None and field_position_local is None: # Remove all fields with tag 'tag'. for field in rec[tag]: if field[1] != ind1 or field[2] != ind2: newfields.append(field) else: deleted.append(field) rec[tag] = newfields elif field_position_global is not None: # Remove the field with 'field_position_global'. for field in rec[tag]: if (field[1] != ind1 and field[2] != ind2 or field[4] != field_position_global): newfields.append(field) else: deleted.append(field) rec[tag] = newfields elif field_position_local is not None: # Remove the field with 'field_position_local'. try: del rec[tag][field_position_local] except IndexError: return [] if not rec[tag]: # Tag is now empty, remove it. del rec[tag] return deleted def record_delete_fields(rec, tag, field_positions_local=None): """ Delete all/some fields defined with MARC tag 'tag' from record 'rec'. @param rec: a record structure. @type rec: tuple @param tag: three letter field. @type tag: string @param field_position_local: if set, it is the list of local positions within all the fields with the specified tag, that should be deleted. If not set all the fields with the specified tag will be deleted. @type field_position_local: sequence @return: the list of deleted fields. @rtype: list @note: the record is modified in place. """ if tag not in rec: return [] new_fields, deleted_fields = [], [] for position, field in enumerate(rec.get(tag, [])): if field_positions_local is None or position in field_positions_local: deleted_fields.append(field) else: new_fields.append(field) if new_fields: rec[tag] = new_fields else: del rec[tag] return deleted_fields def record_add_fields(rec, tag, fields, field_position_local=None, field_position_global=None): """ Adds the fields into the record at the required position. The position is specified by the tag and the field_position_local in the list of fields. @param rec: a record structure @param tag: the tag of the fields to be moved @param field_position_local: the field_position_local to which the field will be inserted. If not specified, appends the fields to the tag. @param a: list of fields to be added @return: -1 if the operation failed, or the field_position_local if it was successful """ if field_position_local is None and field_position_global is None: for field in fields: record_add_field(rec, tag, ind1=field[1], ind2=field[2], subfields=field[0], controlfield_value=field[3]) else: fields.reverse() for field in fields: record_add_field(rec, tag, ind1=field[1], ind2=field[2], subfields=field[0], controlfield_value=field[3], field_position_local=field_position_local, field_position_global=field_position_global) return field_position_local def record_move_fields(rec, tag, field_positions_local, field_position_local=None): """ Moves some fields to the position specified by 'field_position_local'. @param rec: a record structure as returned by create_record() @param tag: the tag of the fields to be moved @param field_positions_local: the positions of the fields to move @param field_position_local: insert the field before that field_position_local. If unspecified, appends the fields @return: the field_position_local is the operation was successful """ fields = record_delete_fields(rec, tag, field_positions_local=field_positions_local) return record_add_fields(rec, tag, fields, field_position_local=field_position_local) def record_delete_subfield(rec, tag, subfield_code, ind1=' ', ind2=' '): """Deletes all subfields with subfield_code in the record.""" ind1, ind2 = _wash_indicators(ind1, ind2) for field in rec.get(tag, []): if field[1] == ind1 and field[2] == ind2: field[0][:] = [subfield for subfield in field[0] if subfield_code != subfield[0]] def record_get_field(rec, tag, field_position_global=None, field_position_local=None): """ Returns the the matching field. One has to enter either a global field position or a local field position. @return: a list of subfield tuples (subfield code, value). @rtype: list """ if field_position_global is None and field_position_local is None: raise InvenioBibRecordFieldError("A field position is required to " "complete this operation.") elif field_position_global is not None and field_position_local is not None: raise InvenioBibRecordFieldError("Only one field position is required " "to complete this operation.") elif field_position_global: if not tag in rec: raise InvenioBibRecordFieldError("No tag '%s' in record." % tag) for field in rec[tag]: if field[4] == field_position_global: return field raise InvenioBibRecordFieldError("No field has the tag '%s' and the " "global field position '%d'." % (tag, field_position_global)) else: try: return rec[tag][field_position_local] except KeyError: raise InvenioBibRecordFieldError("No tag '%s' in record." % tag) except IndexError: raise InvenioBibRecordFieldError("No field has the tag '%s' and " "the local field position '%d'." % (tag, field_position_local)) def record_replace_field(rec, tag, new_field, field_position_global=None, field_position_local=None): """Replaces a field with a new field.""" if field_position_global is None and field_position_local is None: raise InvenioBibRecordFieldError("A field position is required to " "complete this operation.") elif field_position_global is not None and field_position_local is not None: raise InvenioBibRecordFieldError("Only one field position is required " "to complete this operation.") elif field_position_global: if not tag in rec: raise InvenioBibRecordFieldError("No tag '%s' in record." % tag) replaced = False for position, field in enumerate(rec[tag]): if field[4] == field_position_global: rec[tag][position] = new_field replaced = True if not replaced: raise InvenioBibRecordFieldError("No field has the tag '%s' and " "the global field position '%d'." % (tag, field_position_global)) else: try: rec[tag][field_position_local] = new_field except KeyError: raise InvenioBibRecordFieldError("No tag '%s' in record." % tag) except IndexError: raise InvenioBibRecordFieldError("No field has the tag '%s' and " "the local field position '%d'." % (tag, field_position_local)) def record_get_subfields(rec, tag, field_position_global=None, field_position_local=None): """ Returns the subfield of the matching field. One has to enter either a global field position or a local field position. @return: a list of subfield tuples (subfield code, value). @rtype: list """ field = record_get_field(rec, tag, field_position_global=field_position_global, field_position_local=field_position_local) return field[0] def record_delete_subfield_from(rec, tag, subfield_position, field_position_global=None, field_position_local=None): """Delete subfield from position specified by tag, field number and subfield position.""" subfields = record_get_subfields(rec, tag, field_position_global=field_position_global, field_position_local=field_position_local) try: del subfields[subfield_position] except IndexError: from invenio.xmlmarc2textmarc import create_marc_record recordMarc = create_marc_record(rec, 0, {"text-marc": 1, "aleph-marc": 0}) raise InvenioBibRecordFieldError("The record : %(recordCode)s does not contain the subfield " "'%(subfieldIndex)s' inside the field (local: '%(fieldIndexLocal)s, global: '%(fieldIndexGlobal)s' ) of tag '%(tag)s'." % \ {"subfieldIndex" : subfield_position, \ "fieldIndexLocal" : str(field_position_local), \ "fieldIndexGlobal" : str(field_position_global), \ "tag" : tag, \ "recordCode" : recordMarc}) if not subfields: if field_position_global is not None: for position, field in enumerate(rec[tag]): if field[4] == field_position_global: del rec[tag][position] else: del rec[tag][field_position_local] if not rec[tag]: del rec[tag] def record_add_subfield_into(rec, tag, subfield_code, value, subfield_position=None, field_position_global=None, field_position_local=None): """Add subfield into position specified by tag, field number and optionally by subfield position.""" subfields = record_get_subfields(rec, tag, field_position_global=field_position_global, field_position_local=field_position_local) if subfield_position is None: subfields.append((subfield_code, value)) else: subfields.insert(subfield_position, (subfield_code, value)) def record_modify_controlfield(rec, tag, controlfield_value, field_position_global=None, field_position_local=None): """Modify controlfield at position specified by tag and field number.""" field = record_get_field(rec, tag, field_position_global=field_position_global, field_position_local=field_position_local) new_field = (field[0], field[1], field[2], controlfield_value, field[4]) record_replace_field(rec, tag, new_field, field_position_global=field_position_global, field_position_local=field_position_local) def record_modify_subfield(rec, tag, subfield_code, value, subfield_position, field_position_global=None, field_position_local=None): """Modify subfield at position specified by tag, field number and subfield position.""" subfields = record_get_subfields(rec, tag, field_position_global=field_position_global, field_position_local=field_position_local) try: subfields[subfield_position] = (subfield_code, value) except IndexError: raise InvenioBibRecordFieldError("There is no subfield with position " "'%d'." % subfield_position) def record_move_subfield(rec, tag, subfield_position, new_subfield_position, field_position_global=None, field_position_local=None): """Move subfield at position specified by tag, field number and subfield position to new subfield position.""" subfields = record_get_subfields(rec, tag, field_position_global=field_position_global, field_position_local=field_position_local) try: subfield = subfields.pop(subfield_position) subfields.insert(new_subfield_position, subfield) except IndexError: raise InvenioBibRecordFieldError("There is no subfield with position " "'%d'." % subfield_position) def record_get_field_value(rec, tag, ind1=" ", ind2=" ", code=""): """Returns first (string) value that matches specified field (tag, ind1, ind2, code) of the record (rec). Returns empty string if not found. Parameters (tag, ind1, ind2, code) can contain wildcard %. Difference between wildcard % and empty '': - Empty char specifies that we are not interested in a field which has one of the indicator(s)/subfield specified. - Wildcard specifies that we are interested in getting the value of the field whatever the indicator(s)/subfield is. For e.g. consider the following record in MARC: 100C5 $$a val1 555AB $$a val2 555AB val3 555 $$a val4 555A val5 >> record_get_field_value(record, '555', 'A', '', '') >> "val5" >> record_get_field_value(record, '555', 'A', '%', '') >> "val3" >> record_get_field_value(record, '555', 'A', '%', '%') >> "val2" >> record_get_field_value(record, '555', 'A', 'B', '') >> "val3" >> record_get_field_value(record, '555', '', 'B', 'a') >> "" >> record_get_field_value(record, '555', '', '', 'a') >> "val4" >> record_get_field_value(record, '555', '', '', '') >> "" >> record_get_field_value(record, '%%%', '%', '%', '%') >> "val1" @param rec: a record structure as returned by create_record() @param tag: a 3 characters long string @param ind1: a 1 character long string @param ind2: a 1 character long string @param code: a 1 character long string @return: string value (empty if nothing found)""" # Note: the code is quite redundant for speed reasons (avoid calling # functions or doing tests inside loops) ind1, ind2 = _wash_indicators(ind1, ind2) if '%' in tag: # Wild card in tag. Must find all corresponding fields if code == '': # Code not specified. for field_tag, fields in rec.items(): if _tag_matches_pattern(field_tag, tag): for field in fields: if ind1 in ('%', field[1]) and ind2 in ('%', field[2]): # Return matching field value if not empty if field[3]: return field[3] elif code == '%': # Code is wildcard. Take first subfield of first matching field for field_tag, fields in rec.items(): if _tag_matches_pattern(field_tag, tag): for field in fields: if (ind1 in ('%', field[1]) and ind2 in ('%', field[2]) and field[0]): return field[0][0][1] else: # Code is specified. Take corresponding one for field_tag, fields in rec.items(): if _tag_matches_pattern(field_tag, tag): for field in fields: if ind1 in ('%', field[1]) and ind2 in ('%', field[2]): for subfield in field[0]: if subfield[0] == code: return subfield[1] else: # Tag is completely specified. Use tag as dict key if tag in rec: if code == '': # Code not specified. for field in rec[tag]: if ind1 in ('%', field[1]) and ind2 in ('%', field[2]): # Return matching field value if not empty # or return "" empty if not exist. if field[3]: return field[3] elif code == '%': # Code is wildcard. Take first subfield of first matching field for field in rec[tag]: if (ind1 in ('%', field[1]) and ind2 in ('%', field[2]) and field[0]): return field[0][0][1] else: # Code is specified. Take corresponding one for field in rec[tag]: if ind1 in ('%', field[1]) and ind2 in ('%', field[2]): for subfield in field[0]: if subfield[0] == code: return subfield[1] # Nothing was found return "" def record_get_field_values(rec, tag, ind1=" ", ind2=" ", code=""): """Returns the list of (string) values for the specified field (tag, ind1, ind2, code) of the record (rec). Returns empty list if not found. Parameters (tag, ind1, ind2, code) can contain wildcard %. @param rec: a record structure as returned by create_record() @param tag: a 3 characters long string @param ind1: a 1 character long string @param ind2: a 1 character long string @param code: a 1 character long string @return: a list of strings""" tmp = [] ind1, ind2 = _wash_indicators(ind1, ind2) if '%' in tag: # Wild card in tag. Must find all corresponding tags and fields tags = [k for k in rec if _tag_matches_pattern(k, tag)] if code == '': # Code not specified. Consider field value (without subfields) for tag in tags: for field in rec[tag]: if (ind1 in ('%', field[1]) and ind2 in ('%', field[2]) and field[3]): tmp.append(field[3]) elif code == '%': # Code is wildcard. Consider all subfields for tag in tags: for field in rec[tag]: if ind1 in ('%', field[1]) and ind2 in ('%', field[2]): for subfield in field[0]: tmp.append(subfield[1]) else: # Code is specified. Consider all corresponding subfields for tag in tags: for field in rec[tag]: if ind1 in ('%', field[1]) and ind2 in ('%', field[2]): for subfield in field[0]: if subfield[0] == code: tmp.append(subfield[1]) else: # Tag is completely specified. Use tag as dict key if rec and tag in rec: if code == '': # Code not specified. Consider field value (without subfields) for field in rec[tag]: if (ind1 in ('%', field[1]) and ind2 in ('%', field[2]) and field[3]): tmp.append(field[3]) elif code == '%': # Code is wildcard. Consider all subfields for field in rec[tag]: if ind1 in ('%', field[1]) and ind2 in ('%', field[2]): for subfield in field[0]: tmp.append(subfield[1]) else: # Code is specified. Take corresponding one for field in rec[tag]: if ind1 in ('%', field[1]) and ind2 in ('%', field[2]): for subfield in field[0]: if subfield[0] == code: tmp.append(subfield[1]) # If tmp was not set, nothing was found return tmp def record_xml_output(rec, tags=None): """Generates the XML for record 'rec' and returns it as a string @rec: record @tags: list of tags to be printed""" if tags is None: tags = [] if isinstance(tags, str): tags = [tags] if tags and '001' not in tags: # Add the missing controlfield. tags.append('001') marcxml = [''] # Add the tag 'tag' to each field in rec[tag] fields = [] for tag in rec: if not tags or tag in tags: for field in rec[tag]: fields.append((tag, field)) record_order_fields(fields) for field in fields: marcxml.append(field_xml_output(field[1], field[0])) marcxml.append('') return '\n'.join(marcxml) def field_get_subfield_instances(field): """Returns the list of subfields associated with field 'field'""" return field[0] def field_get_subfield_values(field_instance, code): """Return subfield CODE values of the field instance FIELD.""" return [subfield_value for subfield_code, subfield_value in field_instance[0] if subfield_code == code] def field_add_subfield(field, code, value): """Adds a subfield to field 'field'""" field[0].append((code, value)) def record_order_fields(rec, fun="_order_by_ord"): """Orders field inside record 'rec' according to a function""" rec.sort(eval(fun)) def field_xml_output(field, tag): """Generates the XML for field 'field' and returns it as a string.""" marcxml = [] if field[3]: marcxml.append(' %s' % (tag, encode_for_xml(field[3]))) else: marcxml.append(' ' % (tag, field[1], field[2])) marcxml += [_subfield_xml_output(subfield) for subfield in field[0]] marcxml.append(' ') return '\n'.join(marcxml) def record_extract_oai_id(record): """Returns the OAI ID of the record.""" tag = CFG_BIBUPLOAD_EXTERNAL_OAIID_TAG[0:3] ind1 = CFG_BIBUPLOAD_EXTERNAL_OAIID_TAG[3] ind2 = CFG_BIBUPLOAD_EXTERNAL_OAIID_TAG[4] subfield = CFG_BIBUPLOAD_EXTERNAL_OAIID_TAG[5] values = record_get_field_values(record, tag, ind1, ind2, subfield) oai_id_regex = re.compile("oai[a-zA-Z0-9/.:]+") for value in [value.strip() for value in values]: if oai_id_regex.match(value): return value return "" def print_rec(rec, format=1, tags=None): """prints a record format = 1 -- XML format = 2 -- HTML (not implemented) @tags: list of tags to be printed """ if tags is None: tags = [] if format == 1: text = record_xml_output(rec, tags) else: return '' return text def print_recs(listofrec, format=1, tags=None): """prints a list of records format = 1 -- XML format = 2 -- HTML (not implemented) @tags: list of tags to be printed if 'listofrec' is not a list it returns empty string """ if tags is None: tags = [] text = "" if type(listofrec).__name__ !='list': return "" else: for rec in listofrec: text = "%s\n%s" % (text, print_rec(rec, format, tags)) return text def concat(alist): """Concats a list of lists""" newl = [] for l in alist: newl.extend(l) return newl def print_errors(alist): """Creates a unique string with the strings in list, using '\n' as a separator.""" text = "" for l in alist: text = '%s\n%s'% (text, l) return text def record_find_field(rec, tag, field, strict=False): """ Returns the global and local positions of the first occurrence of the field in a record. @param rec: A record dictionary structure @type rec: dictionary @param tag: The tag of the field to search for @type tag: string @param field: A field tuple as returned by create_field() @type field: tuple @param strict: A boolean describing the search method. If strict is False, then the order of the subfields doesn't matter. Default search method is strict. @type strict: boolean @return: A tuple of (global_position, local_position) or a tuple (None, None) if the field is not present. @rtype: tuple @raise InvenioBibRecordFieldError: If the provided field is invalid. """ try: _check_field_validity(field) except InvenioBibRecordFieldError: raise for local_position, field1 in enumerate(rec.get(tag, [])): if _compare_fields(field, field1, strict): return (field1[4], local_position) return (None, None) def record_strip_empty_volatile_subfields(rec): """ Removes unchanged volatile subfields from the record """ for tag in rec.keys(): for field in rec[tag]: field[0][:] = [subfield for subfield in field[0] if subfield[1][:9] != "VOLATILE:"] def record_strip_empty_fields(rec, tag=None): """ Removes empty subfields and fields from the record. If 'tag' is not None, only a specific tag of the record will be stripped, otherwise the whole record. @param rec: A record dictionary structure @type rec: dictionary @param tag: The tag of the field to strip empty fields from @type tag: string """ # Check whole record if tag is None: tags = rec.keys() for tag in tags: record_strip_empty_fields(rec, tag) # Check specific tag of the record elif tag in rec: # in case of a controlfield if tag[:2] == '00': if len(rec[tag]) == 0 or not rec[tag][0][3]: del rec[tag] #in case of a normal field else: fields = [] for field in rec[tag]: subfields = [] for subfield in field[0]: # check if the subfield has been given a value if subfield[1]: subfields.append(subfield) if len(subfields) > 0: new_field = create_field(subfields, field[1], field[2], field[3]) fields.append(new_field) if len(fields) > 0: rec[tag] = fields else: del rec[tag] ### IMPLEMENTATION / INVISIBLE FUNCTIONS def _compare_fields(field1, field2, strict=True): """ Compares 2 fields. If strict is True, then the order of the subfield will be taken care of, if not then the order of the subfields doesn't matter. @return: True if the field are equivalent, False otherwise. """ if strict: # Return a simple equal test on the field minus the position. return field1[:4] == field2[:4] else: if field1[1:4] != field2[1:4]: # Different indicators or controlfield value. return False else: # Compare subfields in a loose way. return set(field1[0]) == set(field2[0]) def _check_field_validity(field): """ Checks if a field is well-formed. @param field: A field tuple as returned by create_field() @type field: tuple @raise InvenioBibRecordFieldError: If the field is invalid. """ if type(field) not in (list, tuple): raise InvenioBibRecordFieldError("Field of type '%s' should be either " "a list or a tuple." % type(field)) if len(field) != 5: raise InvenioBibRecordFieldError("Field of length '%d' should have 5 " "elements." % len(field)) if type(field[0]) not in (list, tuple): raise InvenioBibRecordFieldError("Subfields of type '%s' should be " "either a list or a tuple." % type(field[0])) if type(field[1]) is not str: raise InvenioBibRecordFieldError("Indicator 1 of type '%s' should be " "a string." % type(field[1])) if type(field[2]) is not str: raise InvenioBibRecordFieldError("Indicator 2 of type '%s' should be " "a string." % type(field[2])) if type(field[3]) is not str: raise InvenioBibRecordFieldError("Controlfield value of type '%s' " "should be a string." % type(field[3])) if type(field[4]) is not int: raise InvenioBibRecordFieldError("Global position of type '%s' should " "be an int." % type(field[4])) for subfield in field[0]: if (type(subfield) not in (list, tuple) or len(subfield) != 2 or type(subfield[0]) is not str or type(subfield[1]) is not str): raise InvenioBibRecordFieldError("Subfields are malformed. " "Should a list of tuples of 2 strings.") def _shift_field_positions_global(record, start, delta=1): """Shifts all global field positions with global field positions higher or equal to 'start' from the value 'delta'.""" if not delta: return for tag, fields in record.items(): newfields = [] for field in fields: if field[4] < start: newfields.append(field) else: # Increment the global field position by delta. newfields.append(tuple(list(field[:4]) + [field[4] + delta])) record[tag] = newfields def _tag_matches_pattern(tag, pattern): """Returns true if MARC 'tag' matches a 'pattern'. 'pattern' is plain text, with % as wildcard Both parameters must be 3 characters long strings. For e.g. >> _tag_matches_pattern("909", "909") -> True >> _tag_matches_pattern("909", "9%9") -> True >> _tag_matches_pattern("909", "9%8") -> False @param tag: a 3 characters long string @param pattern: a 3 characters long string @return: False or True""" for char1, char2 in zip(tag, pattern): if char2 not in ('%', char1): return False return True def validate_record_field_positions_global(record): """ Checks if the global field positions in the record are valid ie no duplicate global field positions and local field positions in the list of fields are ascending. @param record: the record data structure @return: the first error found as a string or None if no error was found """ all_fields = [] for tag, fields in record.items(): previous_field_position_global = -1 for field in fields: if field[4] < previous_field_position_global: return "Non ascending global field positions in tag '%s'." % tag previous_field_position_global = field[4] if field[4] in all_fields: return ("Duplicate global field position '%d' in tag '%s'" % (field[4], tag)) def _record_sort_by_indicators(record): """Sorts the fields inside the record by indicators.""" for tag, fields in record.items(): record[tag] = _fields_sort_by_indicators(fields) def _fields_sort_by_indicators(fields): """Sorts a set of fields by their indicators. Returns a sorted list with correct global field positions.""" field_dict = {} field_positions_global = [] for field in fields: field_dict.setdefault(field[1:3], []).append(field) field_positions_global.append(field[4]) indicators = field_dict.keys() indicators.sort() field_list = [] for indicator in indicators: for field in field_dict[indicator]: field_list.append(field[:4] + (field_positions_global.pop(0),)) return field_list def _select_parser(parser=None): """Selects the more relevant parser based on the parsers available and on the parser desired by the user.""" if not AVAILABLE_PARSERS: # No parser is available. This is bad. return None if parser is None or parser not in AVAILABLE_PARSERS: # Return the best available parser. return AVAILABLE_PARSERS[0] else: return parser def _create_record_rxp(marcxml, verbose=CFG_BIBRECORD_DEFAULT_VERBOSE_LEVEL, correct=CFG_BIBRECORD_DEFAULT_CORRECT, keep_singletons=CFG_BIBRECORD_KEEP_SINGLETONS): """Creates a record object using the RXP parser. If verbose>3 then the parser will be strict and will stop in case of well-formedness errors or DTD errors. If verbose=0, the parser will not give warnings. If 0 < verbose <= 3, the parser will not give errors, but will warn the user about possible mistakes correct != 0 -> We will try to correct errors such as missing attributes correct = 0 -> there will not be any attempt to correct errors""" if correct: # Note that with pyRXP < 1.13 a memory leak has been found # involving DTD parsing. So enable correction only if you have # pyRXP 1.13 or greater. marcxml = ('\n' '\n' '\n%s\n' % (CFG_MARC21_DTD, marcxml)) # Create the pyRXP parser. pyrxp_parser = pyRXP.Parser(ErrorOnValidityErrors=0, ProcessDTD=1, ErrorOnUnquotedAttributeValues=0, srcName='string input') if verbose > 3: pyrxp_parser.ErrorOnValidityErrors = 1 pyrxp_parser.ErrorOnUnquotedAttributeValues = 1 try: root = pyrxp_parser.parse(marcxml) except pyRXP.error, ex1: raise InvenioBibRecordParserError(str(ex1)) # If record is enclosed in a collection tag, extract it. if root[TAG] == 'collection': children = _get_children_by_tag_name_rxp(root, 'record') if not children: return {} root = children[0] record = {} # This is needed because of the record_xml_output function, where we # need to know the order of the fields. field_position_global = 1 # Consider the control fields. for controlfield in _get_children_by_tag_name_rxp(root, 'controlfield'): if controlfield[CHILDREN]: value = ''.join([n for n in controlfield[CHILDREN]]) # Construct the field tuple. field = ([], ' ', ' ', value, field_position_global) record.setdefault(controlfield[ATTRS]['tag'], []).append(field) field_position_global += 1 elif keep_singletons: field = ([], ' ', ' ', '', field_position_global) record.setdefault(controlfield[ATTRS]['tag'], []).append(field) field_position_global += 1 # Consider the data fields. for datafield in _get_children_by_tag_name_rxp(root, 'datafield'): subfields = [] for subfield in _get_children_by_tag_name_rxp(datafield, 'subfield'): if subfield[CHILDREN]: value = ''.join([n for n in subfield[CHILDREN]]) subfields.append((subfield[ATTRS].get('code', '!'), value)) elif keep_singletons: subfields.append((subfield[ATTRS].get('code', '!'), '')) if subfields or keep_singletons: # Create the field. tag = datafield[ATTRS].get('tag', '!') ind1 = datafield[ATTRS].get('ind1', '!') ind2 = datafield[ATTRS].get('ind2', '!') ind1, ind2 = _wash_indicators(ind1, ind2) # Construct the field tuple. field = (subfields, ind1, ind2, '', field_position_global) record.setdefault(tag, []).append(field) field_position_global += 1 return record def _create_record_from_document(document, keep_singletons=CFG_BIBRECORD_KEEP_SINGLETONS): """Creates a record from the document (of type xml.dom.minidom.Document or Ft.Xml.Domlette.Document).""" root = None for node in document.childNodes: if node.nodeType == node.ELEMENT_NODE: root = node break if root is None: return {} if root.tagName == 'collection': children = _get_children_by_tag_name(root, 'record') if not children: return {} root = children[0] field_position_global = 1 record = {} for controlfield in _get_children_by_tag_name(root, "controlfield"): tag = controlfield.getAttributeNS(None, "tag").encode('utf-8') text_nodes = controlfield.childNodes value = ''.join([n.data for n in text_nodes]).encode("utf-8") if value or keep_singletons: field = ([], " ", " ", value, field_position_global) record.setdefault(tag, []).append(field) field_position_global += 1 for datafield in _get_children_by_tag_name(root, "datafield"): subfields = [] for subfield in _get_children_by_tag_name(datafield, "subfield"): text_nodes = subfield.childNodes value = ''.join([n.data for n in text_nodes]).encode("utf-8") if value or keep_singletons: code = subfield.getAttributeNS(None, 'code').encode("utf-8") subfields.append((code or '!', value)) if subfields or keep_singletons: tag = datafield.getAttributeNS(None, "tag").encode("utf-8") or '!' ind1 = datafield.getAttributeNS(None, "ind1").encode("utf-8") ind2 = datafield.getAttributeNS(None, "ind2").encode("utf-8") ind1, ind2 = _wash_indicators(ind1, ind2) field = (subfields, ind1, ind2, "", field_position_global) record.setdefault(tag, []).append(field) field_position_global += 1 return record def _create_record_minidom(marcxml, keep_singletons=CFG_BIBRECORD_KEEP_SINGLETONS): """Creates a record using minidom.""" try: dom = xml.dom.minidom.parseString(marcxml) except xml.parsers.expat.ExpatError, ex1: raise InvenioBibRecordParserError(str(ex1)) return _create_record_from_document(dom, keep_singletons=keep_singletons) def _create_record_4suite(marcxml, keep_singletons=CFG_BIBRECORD_KEEP_SINGLETONS): """Creates a record using the 4suite parser.""" try: dom = Ft.Xml.Domlette.NonvalidatingReader.parseString(marcxml, "urn:dummy") except Ft.Xml.ReaderException, ex1: raise InvenioBibRecordParserError(ex1.message) return _create_record_from_document(dom, keep_singletons=keep_singletons) def _concat(alist): """Concats a list of lists""" return [element for single_list in alist for element in single_list] def _subfield_xml_output(subfield): """Generates the XML for a subfield object and return it as a string""" return ' %s' % (subfield[0], encode_for_xml(subfield[1])) def _order_by_ord(field1, field2): """Function used to order the fields according to their ord value""" return cmp(field1[1][4], field2[1][4]) def _get_children_by_tag_name(node, name): """Retrieves all children from node 'node' with name 'name' and returns them as a list.""" try: return [child for child in node.childNodes if child.nodeName == name] except TypeError: return [] def _get_children_by_tag_name_rxp(node, name): """Retrieves all children from 'children' with tag name 'tag' and returns them as a list. children is a list returned by the RXP parser""" try: return [child for child in node[CHILDREN] if child[TAG] == name] except TypeError: return [] def _wash_indicators(*indicators): """ Washes the values of the indicators. An empty string or an underscore is replaced by a blank space. @param indicators: a series of indicators to be washed @return: a list of washed indicators """ return [indicator in ('', '_') and ' ' or indicator for indicator in indicators] def _correct_record(record): """ Checks and corrects the structure of the record. @param record: the record data structure @return: a list of errors found """ errors = [] for tag in record.keys(): upper_bound = '999' n = len(tag) if n > 3: i = n - 3 while i > 0: upper_bound = '%s%s' % ('0', upper_bound) i -= 1 # Missing tag. Replace it with dummy tag '000'. if tag == '!': errors.append((1, '(field number(s): ' + str([f[4] for f in record[tag]]) + ')')) record['000'] = record.pop(tag) tag = '000' elif not ('001' <= tag <= upper_bound or tag in ('FMT', 'FFT')): errors.append(2) record['000'] = record.pop(tag) tag = '000' fields = [] for field in record[tag]: # Datafield without any subfield. if field[0] == [] and field[3] == '': errors.append((8, '(field number: ' + str(field[4]) + ')')) subfields = [] for subfield in field[0]: if subfield[0] == '!': errors.append((3, '(field number: ' + str(field[4]) + ')')) newsub = ('', subfield[1]) else: newsub = subfield subfields.append(newsub) if field[1] == '!': errors.append((4, '(field number: ' + str(field[4]) + ')')) ind1 = " " else: ind1 = field[1] if field[2] == '!': errors.append((5, '(field number: ' + str(field[4]) + ')')) ind2 = " " else: ind2 = field[2] fields.append((subfields, ind1, ind2, field[3], field[4])) record[tag] = fields return errors def _warning(code): """It returns a warning message of code 'code'. If code = (cd, str) it returns the warning message of code 'cd' and appends str at the end""" if isinstance(code, str): return code message = '' if isinstance(code, tuple): if isinstance(code[0], str): message = code[1] code = code[0] return CFG_BIBRECORD_WARNING_MSGS.get(code, '') + message def _warnings(alist): """Applies the function _warning() to every element in l.""" return [_warning(element) for element in alist] def _compare_lists(list1, list2, custom_cmp): """Compares twolists using given comparing function @param list1: first list to compare @param list2: second list to compare @param custom_cmp: a function taking two arguments (element of list 1, element of list 2) and @return: True or False depending if the values are the same""" if len(list1) != len(list2): return False for element1, element2 in zip(list1, list2): if not custom_cmp(element1, element2): return False return True - -if PSYCO_AVAILABLE: - psyco.bind(_correct_record) - psyco.bind(_create_record_4suite) - psyco.bind(_create_record_rxp) - psyco.bind(_create_record_minidom) - psyco.bind(field_get_subfield_values) - psyco.bind(create_records) - psyco.bind(create_record) - psyco.bind(record_get_field_instances) - psyco.bind(record_get_field_value) - psyco.bind(record_get_field_values) diff --git a/modules/bibindex/lib/bibindex_engine.py b/modules/bibindex/lib/bibindex_engine.py index 26941b81e..45975f736 100644 --- a/modules/bibindex/lib/bibindex_engine.py +++ b/modules/bibindex/lib/bibindex_engine.py @@ -1,1732 +1,1723 @@ # -*- coding: utf-8 -*- ## ## This file is part of Invenio. ## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """ BibIndex indexing engine implementation. See bibindex executable for entry point. """ __revision__ = "$Id$" import os import re import sys import time from invenio.config import \ CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS, \ CFG_BIBINDEX_CHARS_PUNCTUATION, \ CFG_BIBINDEX_FULLTEXT_INDEX_LOCAL_FILES_ONLY, \ CFG_BIBINDEX_MIN_WORD_LENGTH, \ CFG_BIBINDEX_REMOVE_HTML_MARKUP, \ CFG_BIBINDEX_REMOVE_LATEX_MARKUP, \ CFG_BIBINDEX_AUTHOR_WORD_INDEX_EXCLUDE_FIRST_NAMES, \ CFG_CERN_SITE, CFG_INSPIRE_SITE, \ CFG_BIBINDEX_PERFORM_OCR_ON_DOCNAMES, \ CFG_BIBINDEX_SPLASH_PAGES from invenio.websubmit_config import CFG_WEBSUBMIT_BEST_FORMATS_TO_EXTRACT_TEXT_FROM from invenio.bibindex_engine_config import CFG_MAX_MYSQL_THREADS, \ CFG_MYSQL_THREAD_TIMEOUT, \ CFG_CHECK_MYSQL_THREADS from invenio.bibindex_engine_tokenizer import BibIndexFuzzyNameTokenizer, \ BibIndexExactNameTokenizer from invenio.bibdocfile import bibdocfile_url_p, \ bibdocfile_url_to_bibdoc, normalize_format, \ download_url, guess_format_from_url, BibRecDocs from invenio.websubmit_file_converter import convert_file from invenio.search_engine import perform_request_search, strip_accents, \ wash_index_term, lower_index_term, get_index_stemming_language from invenio.dbquery import run_sql, DatabaseError, serialize_via_marshal, \ deserialize_via_marshal from invenio.bibindex_engine_stopwords import is_stopword from invenio.bibindex_engine_stemmer import stem from invenio.bibtask import task_init, write_message, get_datetime, \ task_set_option, task_get_option, task_get_task_param, task_update_status, \ task_update_progress, task_sleep_now_if_required from invenio.intbitset import intbitset from invenio.errorlib import register_exception from invenio.htmlutils import remove_html_markup from invenio.textutils import wash_for_utf8 if sys.hexversion < 0x2040000: # pylint: disable=W0622 from sets import Set as set # pylint: enable=W0622 # FIXME: journal tag and journal pubinfo standard format are defined here: if CFG_CERN_SITE: CFG_JOURNAL_TAG = '773__%' CFG_JOURNAL_PUBINFO_STANDARD_FORM = "773__p 773__v (773__y) 773__c" elif CFG_INSPIRE_SITE: CFG_JOURNAL_TAG = '773__%' CFG_JOURNAL_PUBINFO_STANDARD_FORM = "773__p,773__v,773__c" else: CFG_JOURNAL_TAG = '909C4%' CFG_JOURNAL_PUBINFO_STANDARD_FORM = "909C4p 909C4v (909C4y) 909C4c" ## precompile some often-used regexp for speed reasons: re_subfields = re.compile('\$\$\w') re_block_punctuation_begin = re.compile(r"^"+CFG_BIBINDEX_CHARS_PUNCTUATION+"+") re_block_punctuation_end = re.compile(CFG_BIBINDEX_CHARS_PUNCTUATION+"+$") re_punctuation = re.compile(CFG_BIBINDEX_CHARS_PUNCTUATION) re_separators = re.compile(CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS) re_datetime_shift = re.compile("([-\+]{0,1})([\d]+)([dhms])") re_arxiv = re.compile(r'^arxiv:\d\d\d\d\.\d\d\d\d') nb_char_in_line = 50 # for verbose pretty printing chunksize = 1000 # default size of chunks that the records will be treated by base_process_size = 4500 # process base size _last_word_table = None def list_union(list1, list2): "Returns union of the two lists." union_dict = {} for e in list1: union_dict[e] = 1 for e in list2: union_dict[e] = 1 return union_dict.keys() ## safety function for killing slow DB threads: def kill_sleepy_mysql_threads(max_threads=CFG_MAX_MYSQL_THREADS, thread_timeout=CFG_MYSQL_THREAD_TIMEOUT): """Check the number of DB threads and if there are more than MAX_THREADS of them, lill all threads that are in a sleeping state for more than THREAD_TIMEOUT seconds. (This is useful for working around the the max_connection problem that appears during indexation in some not-yet-understood cases.) If some threads are to be killed, write info into the log file. """ res = run_sql("SHOW FULL PROCESSLIST") if len(res) > max_threads: for row in res: r_id, dummy, dummy, dummy, r_command, r_time, dummy, dummy = row if r_command == "Sleep" and int(r_time) > thread_timeout: run_sql("KILL %s", (r_id,)) write_message("WARNING: too many DB threads, killing thread %s" % r_id, verbose=1) return ## MARC-21 tag/field access functions def get_fieldvalues(recID, tag): """Returns list of values of the MARC-21 'tag' fields for the record 'recID'.""" bibXXx = "bib" + tag[0] + tag[1] + "x" bibrec_bibXXx = "bibrec_" + bibXXx query = "SELECT value FROM %s AS b, %s AS bb WHERE bb.id_bibrec=%%s AND bb.id_bibxxx=b.id AND tag LIKE %%s" \ % (bibXXx, bibrec_bibXXx) res = run_sql(query, (recID, tag)) return [row[0] for row in res] def get_associated_subfield_value(recID, tag, value, associated_subfield_code): """Return list of ASSOCIATED_SUBFIELD_CODE, if exists, for record RECID and TAG of value VALUE. Used by fulltext indexer only. Note: TAG must be 6 characters long (tag+ind1+ind2+sfcode), otherwise en empty string is returned. FIXME: what if many tag values have the same value but different associated_subfield_code? Better use bibrecord library for this. """ out = "" if len(tag) != 6: return out bibXXx = "bib" + tag[0] + tag[1] + "x" bibrec_bibXXx = "bibrec_" + bibXXx query = """SELECT bb.field_number, b.tag, b.value FROM %s AS b, %s AS bb WHERE bb.id_bibrec=%%s AND bb.id_bibxxx=b.id AND tag LIKE %%s%%""" % (bibXXx, bibrec_bibXXx) res = run_sql(query, (recID, tag[:-1])) field_number = -1 for row in res: if row[1] == tag and row[2] == value: field_number = row[0] if field_number > 0: for row in res: if row[0] == field_number and row[1] == tag[:-1] + associated_subfield_code: out = row[2] break return out def get_field_tags(field): """Returns a list of MARC tags for the field code 'field'. Returns empty list in case of error. Example: field='author', output=['100__%','700__%'].""" out = [] query = """SELECT t.value FROM tag AS t, field_tag AS ft, field AS f WHERE f.code=%s AND ft.id_field=f.id AND t.id=ft.id_tag ORDER BY ft.score DESC""" res = run_sql(query, (field, )) return [row[0] for row in res] ## Fulltext word extraction functions def get_fulltext_urls_from_html_page(htmlpagebody): """Parses htmlpagebody data (the splash page content) looking for url_directs referring to probable fulltexts. Returns an array of (ext,url_direct) to fulltexts. Note: it looks for file format extensions as defined by global 'CFG_WEBSUBMIT_BEST_FORMATS_TO_EXTRACT_TEXT_FROM' structure, minus the HTML ones, because we don't want to index HTML pages that the splash page might point to. """ out = [] for ext in CFG_WEBSUBMIT_BEST_FORMATS_TO_EXTRACT_TEXT_FROM: expr = re.compile( r"\"(http://[\w]+\.+[\w]+[^\"'><]*\." + \ ext + r")\"") match = expr.search(htmlpagebody) if match and ext not in ['htm', 'html']: out.append([ext, match.group(1)]) #else: # FIXME: workaround for getfile, should use bibdoc tables #expr_getfile = re.compile(r"\"(http://.*getfile\.py\?.*format=" + ext + "&version=.*)\"") #match = expr_getfile.search(htmlpagebody) #if match and ext not in ['htm', 'html']: #out.append([ext, match.group(1)]) return out def get_words_from_journal_tag(recID, tag): """ Special procedure to extract words from journal tags. Joins title/volume/year/page into a standard form that is also used for citations. """ # get all journal tags/subfields: bibXXx = "bib" + tag[0] + tag[1] + "x" bibrec_bibXXx = "bibrec_" + bibXXx query = """SELECT bb.field_number,b.tag,b.value FROM %s AS b, %s AS bb WHERE bb.id_bibrec=%%s AND bb.id_bibxxx=b.id AND tag LIKE %%s""" % (bibXXx, bibrec_bibXXx) res = run_sql(query, (recID, tag)) # construct journal pubinfo: dpubinfos = {} for row in res: nb_instance, subfield, value = row if subfield.endswith("c"): # delete pageend if value is pagestart-pageend # FIXME: pages may not be in 'c' subfield value = value.split('-', 1)[0] if dpubinfos.has_key(nb_instance): dpubinfos[nb_instance][subfield] = value else: dpubinfos[nb_instance] = {subfield: value} # construct standard format: lwords = [] for dpubinfo in dpubinfos.values(): # index all journal subfields separately for tag,val in dpubinfo.items(): lwords.append(val) # index journal standard format: pubinfo = CFG_JOURNAL_PUBINFO_STANDARD_FORM for tag,val in dpubinfo.items(): pubinfo = pubinfo.replace(tag,val) if CFG_JOURNAL_TAG[:-1] in pubinfo: # some subfield was missing, do nothing pass else: lwords.append(pubinfo) # return list of words and pubinfos: return lwords def get_words_from_date_tag(datestring, stemming_language=None): """ Special procedure to index words from tags storing date-like information in format YYYY or YYYY-MM or YYYY-MM-DD. Namely, we are indexing word-terms YYYY, YYYY-MM, YYYY-MM-DD, but never standalone MM or DD. """ out = [] for dateword in datestring.split(): # maybe there are whitespaces, so break these too out.append(dateword) parts = dateword.split('-') for nb in range(1,len(parts)): out.append("-".join(parts[:nb])) return out def get_words_from_fulltext(url_direct_or_indirect, stemming_language=None): """Returns all the words contained in the document specified by URL_DIRECT_OR_INDIRECT with the words being split by various SRE_SEPARATORS regexp set earlier. If FORCE_FILE_EXTENSION is set (e.g. to "pdf", then treat URL_DIRECT_OR_INDIRECT as a PDF file. (This is interesting to index Indico for example.) Note also that URL_DIRECT_OR_INDIRECT may be either a direct URL to the fulltext file or an URL to a setlink-like page body that presents the links to be indexed. In the latter case the URL_DIRECT_OR_INDIRECT is parsed to extract actual direct URLs to fulltext documents, for all knows file extensions as specified by global CONV_PROGRAMS config variable. """ re_perform_ocr = re.compile(CFG_BIBINDEX_PERFORM_OCR_ON_DOCNAMES) write_message("... reading fulltext files from %s started" % url_direct_or_indirect, verbose=2) try: if bibdocfile_url_p(url_direct_or_indirect): write_message("... %s is an internal document" % url_direct_or_indirect, verbose=2) bibdoc = bibdocfile_url_to_bibdoc(url_direct_or_indirect) perform_ocr = bool(re_perform_ocr.match(bibdoc.get_docname())) write_message("... will extract words from %s (docid: %s) %s" % (bibdoc.get_docname(), bibdoc.get_id(), perform_ocr and 'with OCR' or ''), verbose=2) if not bibdoc.has_text(require_up_to_date=True): bibdoc.extract_text(perform_ocr=perform_ocr) return get_words_from_phrase(bibdoc.get_text(), stemming_language) else: if CFG_BIBINDEX_FULLTEXT_INDEX_LOCAL_FILES_ONLY: write_message("... %s is external URL but indexing only local files" % url_direct_or_indirect, verbose=2) return [] write_message("... %s is an external URL" % url_direct_or_indirect, verbose=2) best_formats = [normalize_format(format) for format in CFG_WEBSUBMIT_BEST_FORMATS_TO_EXTRACT_TEXT_FROM] format = guess_format_from_url(url_direct_or_indirect) if re.match(CFG_BIBINDEX_SPLASH_PAGES, url_direct_or_indirect): urls = get_fulltext_urls_from_html_page(url_direct_or_indirect) else: urls = [url_direct_or_indirect] write_message("... will extract words from %s" % ', '.join(urls), verbose=2) words = {} for url in urls: format = guess_format_from_url(url) tmpdoc = download_url(url, format) tmptext = convert_file(tmpdoc, output_format='.txt') os.remove(tmpdoc) text = open(tmptext).read() os.remove(tmptext) tmpwords = get_words_from_phrase(text, stemming_language) words.update(dict(map(lambda x: (x, 1), tmpwords))) return words.keys() except Exception, e: register_exception(prefix='ERROR: it\'s impossible to correctly extract words from %s' % url_direct_or_indirect, alert_admin=True) write_message("ERROR: %s" % e, stream=sys.stderr) return [] latex_markup_re = re.compile(r"\\begin(\[.+?\])?\{.+?\}|\\end\{.+?}|\\\w+(\[.+?\])?\{(?P.*?)\}|\{\\\w+ (?P.*?)\}") def remove_latex_markup(phrase): ret_phrase = '' index = 0 for match in latex_markup_re.finditer(phrase): ret_phrase += phrase[index:match.start()] ret_phrase += match.group('inside1') or match.group('inside2') or '' index = match.end() ret_phrase += phrase[index:] return ret_phrase def get_nothing_from_phrase(phrase, stemming_language=None): """ A dump implementation of get_words_from_phrase to be used when when a tag should not be indexed (such as when trying to extract phrases from 8564_u).""" return [] def swap_temporary_reindex_tables(index_id, reindex_prefix="tmp_"): """Atomically swap reindexed temporary table with the original one. Delete the now-old one.""" write_message("Putting new tmp index tables for id %s into production" % index_id) run_sql( "RENAME TABLE " + "idxWORD%02dR TO old_idxWORD%02dR," % (index_id, index_id) + "%sidxWORD%02dR TO idxWORD%02dR," % (reindex_prefix, index_id, index_id) + "idxWORD%02dF TO old_idxWORD%02dF," % (index_id, index_id) + "%sidxWORD%02dF TO idxWORD%02dF," % (reindex_prefix, index_id, index_id) + "idxPAIR%02dR TO old_idxPAIR%02dR," % (index_id, index_id) + "%sidxPAIR%02dR TO idxPAIR%02dR," % (reindex_prefix, index_id, index_id) + "idxPAIR%02dF TO old_idxPAIR%02dF," % (index_id, index_id) + "%sidxPAIR%02dF TO idxPAIR%02dF," % (reindex_prefix, index_id, index_id) + "idxPHRASE%02dR TO old_idxPHRASE%02dR," % (index_id, index_id) + "%sidxPHRASE%02dR TO idxPHRASE%02dR," % (reindex_prefix, index_id, index_id) + "idxPHRASE%02dF TO old_idxPHRASE%02dF," % (index_id, index_id) + "%sidxPHRASE%02dF TO idxPHRASE%02dF;" % (reindex_prefix, index_id, index_id) ) write_message("Dropping old index tables for id %s" % index_id) run_sql("DROP TABLE old_idxWORD%02dR, old_idxWORD%02dF, old_idxPAIR%02dR, old_idxPAIR%02dF, old_idxPHRASE%02dR, old_idxPHRASE%02dF" % (index_id, index_id, index_id, index_id, index_id, index_id) ) def init_temporary_reindex_tables(index_id, reindex_prefix="tmp_"): """Create reindexing temporary tables.""" write_message("Creating new tmp index tables for id %s" % index_id) res = run_sql("""CREATE TABLE IF NOT EXISTS %sidxWORD%02dF ( id mediumint(9) unsigned NOT NULL auto_increment, term varchar(50) default NULL, hitlist longblob, PRIMARY KEY (id), UNIQUE KEY term (term) ) ENGINE=MyISAM""" % (reindex_prefix, index_id)) res = run_sql("""CREATE TABLE IF NOT EXISTS %sidxWORD%02dR ( id_bibrec mediumint(9) unsigned NOT NULL, termlist longblob, type enum('CURRENT','FUTURE','TEMPORARY') NOT NULL default 'CURRENT', PRIMARY KEY (id_bibrec,type) ) ENGINE=MyISAM""" % (reindex_prefix, index_id)) res = run_sql("""CREATE TABLE IF NOT EXISTS %sidxPAIR%02dF ( id mediumint(9) unsigned NOT NULL auto_increment, term varchar(100) default NULL, hitlist longblob, PRIMARY KEY (id), UNIQUE KEY term (term) ) ENGINE=MyISAM""" % (reindex_prefix, index_id)) res = run_sql("""CREATE TABLE IF NOT EXISTS %sidxPAIR%02dR ( id_bibrec mediumint(9) unsigned NOT NULL, termlist longblob, type enum('CURRENT','FUTURE','TEMPORARY') NOT NULL default 'CURRENT', PRIMARY KEY (id_bibrec,type) ) ENGINE=MyISAM""" % (reindex_prefix, index_id)) res = run_sql("""CREATE TABLE IF NOT EXISTS %sidxPHRASE%02dF ( id mediumint(9) unsigned NOT NULL auto_increment, term text default NULL, hitlist longblob, PRIMARY KEY (id), KEY term (term(50)) ) ENGINE=MyISAM""" % (reindex_prefix, index_id)) res = run_sql("""CREATE TABLE IF NOT EXISTS %sidxPHRASE%02dR ( id_bibrec mediumint(9) unsigned NOT NULL default '0', termlist longblob, type enum('CURRENT','FUTURE','TEMPORARY') NOT NULL default 'CURRENT', PRIMARY KEY (id_bibrec,type) ) ENGINE=MyISAM""" % (reindex_prefix, index_id)) run_sql("UPDATE idxINDEX SET last_updated='0000-00-00 00:00:00' WHERE id=%s", (index_id,)) latex_formula_re = re.compile(r'\$.*?\$|\\\[.*?\\\]') def get_words_from_phrase(phrase, stemming_language=None): """Return list of words found in PHRASE. Note that the phrase is split into groups depending on the alphanumeric characters and punctuation characters definition present in the config file. """ words = {} formulas = [] if CFG_BIBINDEX_REMOVE_HTML_MARKUP and phrase.find(" -1: phrase = remove_html_markup(phrase) if CFG_BIBINDEX_REMOVE_LATEX_MARKUP: formulas = latex_formula_re.findall(phrase) phrase = remove_latex_markup(phrase) phrase = latex_formula_re.sub(' ', phrase) phrase = wash_for_utf8(phrase) phrase = lower_index_term(phrase) # 1st split phrase into blocks according to whitespace for block in strip_accents(phrase).split(): # 2nd remove leading/trailing punctuation and add block: block = re_block_punctuation_begin.sub("", block) block = re_block_punctuation_end.sub("", block) if block: stemmed_block = apply_stemming_and_stopwords_and_length_check(block, stemming_language) if stemmed_block: words[stemmed_block] = 1 if re_arxiv.match(block): # special case for blocks like `arXiv:1007.5048' where # we would like to index the part after the colon # regardless of dot or other punctuation characters: words[block.split(':', 1)[1]] = 1 # 3rd break each block into subblocks according to punctuation and add subblocks: for subblock in re_punctuation.split(block): stemmed_subblock = apply_stemming_and_stopwords_and_length_check(subblock, stemming_language) if stemmed_subblock: words[stemmed_subblock] = 1 # 4th break each subblock into alphanumeric groups and add groups: for alphanumeric_group in re_separators.split(subblock): stemmed_alphanumeric_group = apply_stemming_and_stopwords_and_length_check(alphanumeric_group, stemming_language) if stemmed_alphanumeric_group: words[stemmed_alphanumeric_group] = 1 for block in formulas: words[block] = 1 return words.keys() def get_pairs_from_phrase(phrase, stemming_language=None): """Return list of words found in PHRASE. Note that the phrase is split into groups depending on the alphanumeric characters and punctuation characters definition present in the config file. """ words = {} if CFG_BIBINDEX_REMOVE_HTML_MARKUP and phrase.find(" -1: phrase = remove_html_markup(phrase) if CFG_BIBINDEX_REMOVE_LATEX_MARKUP: phrase = remove_latex_markup(phrase) phrase = latex_formula_re.sub(' ', phrase) phrase = wash_for_utf8(phrase) phrase = lower_index_term(phrase) # 1st split phrase into blocks according to whitespace last_word = '' for block in strip_accents(phrase).split(): # 2nd remove leading/trailing punctuation and add block: block = re_block_punctuation_begin.sub("", block) block = re_block_punctuation_end.sub("", block) if block: if stemming_language: block = apply_stemming_and_stopwords_and_length_check(block, stemming_language) # 3rd break each block into subblocks according to punctuation and add subblocks: for subblock in re_punctuation.split(block): if stemming_language: subblock = apply_stemming_and_stopwords_and_length_check(subblock, stemming_language) if subblock: # 4th break each subblock into alphanumeric groups and add groups: for alphanumeric_group in re_separators.split(subblock): if stemming_language: alphanumeric_group = apply_stemming_and_stopwords_and_length_check(alphanumeric_group, stemming_language) if alphanumeric_group: if last_word: words['%s %s' % (last_word, alphanumeric_group)] = 1 last_word = alphanumeric_group return words.keys() phrase_delimiter_re = re.compile(r'[\.:;\?\!]') space_cleaner_re = re.compile(r'\s+') def get_phrases_from_phrase(phrase, stemming_language=None): """Return list of phrases found in PHRASE. Note that the phrase is split into groups depending on the alphanumeric characters and punctuation characters definition present in the config file. """ phrase = wash_for_utf8(phrase) return [phrase] ## Note that we don't break phrases, they are used for exact style ## of searching. words = {} phrase = strip_accents(phrase) # 1st split phrase into blocks according to whitespace for block1 in phrase_delimiter_re.split(strip_accents(phrase)): block1 = block1.strip() if block1 and stemming_language: new_words = [] for block2 in re_punctuation.split(block1): block2 = block2.strip() if block2: for block3 in block2.split(): block3 = block3.strip() if block3: # Note that we don't stem phrases, they # are used for exact style of searching. new_words.append(block3) block1 = ' '.join(new_words) if block1: words[block1] = 1 return words.keys() def get_fuzzy_authors_from_phrase(phrase, stemming_language=None): """ Return list of fuzzy phrase-tokens suitable for storing into author phrase index. """ author_tokenizer = BibIndexFuzzyNameTokenizer() return author_tokenizer.tokenize(phrase) def get_exact_authors_from_phrase(phrase, stemming_language=None): """ Return list of exact phrase-tokens suitable for storing into exact author phrase index. """ author_tokenizer = BibIndexExactNameTokenizer() return author_tokenizer.tokenize(phrase) def get_author_family_name_words_from_phrase(phrase, stemming_language=None): """ Return list of words from author family names, not his/her first names. The phrase is assumed to be the full author name. This is useful for CFG_BIBINDEX_AUTHOR_WORD_INDEX_EXCLUDE_FIRST_NAMES. """ d_family_names = {} # first, treat everything before first comma as surname: if ',' in phrase: d_family_names[phrase.split(',', 1)[0]] = 1 # second, try fuzzy author tokenizer to find surname variants: for name in get_fuzzy_authors_from_phrase(phrase, stemming_language): if ',' in name: d_family_names[name.split(',', 1)[0]] = 1 # now extract words from these surnames: d_family_names_words = {} for family_name in d_family_names.keys(): for word in get_words_from_phrase(family_name, stemming_language): d_family_names_words[word] = 1 return d_family_names_words.keys() def apply_stemming_and_stopwords_and_length_check(word, stemming_language): """Return WORD after applying stemming and stopword and length checks. See the config file in order to influence these. """ # now check against stopwords: if is_stopword(word): return "" # finally check the word length: if len(word) < CFG_BIBINDEX_MIN_WORD_LENGTH: return "" # stem word, when configured so: if stemming_language: word = stem(word, stemming_language) return word def remove_subfields(s): "Removes subfields from string, e.g. 'foo $$c bar' becomes 'foo bar'." return re_subfields.sub(' ', s) def get_index_id_from_index_name(index_name): """Returns the words/phrase index id for INDEXNAME. Returns empty string in case there is no words table for this index. Example: field='author', output=4.""" out = 0 query = """SELECT w.id FROM idxINDEX AS w WHERE w.name=%s LIMIT 1""" res = run_sql(query, (index_name, ), 1) if res: out = res[0][0] return out def get_index_name_from_index_id(index_id): """Returns the words/phrase index name for INDEXID. Returns '' in case there is no words table for this indexid. Example: field=9, output='fulltext'.""" res = run_sql("SELECT name FROM idxINDEX WHERE id=%s", (index_id, )) if res: return res[0][0] return '' def get_index_tags(indexname): """Returns the list of tags that are indexed inside INDEXNAME. Returns empty list in case there are no tags indexed in this index. Note: uses get_field_tags() defined before. Example: field='author', output=['100__%', '700__%'].""" out = [] query = """SELECT f.code FROM idxINDEX AS w, idxINDEX_field AS wf, field AS f WHERE w.name=%s AND w.id=wf.id_idxINDEX AND f.id=wf.id_field""" res = run_sql(query, (indexname, )) for row in res: out.extend(get_field_tags(row[0])) return out def get_all_indexes(): """Returns the list of the names of all defined words indexes. Returns empty list in case there are no tags indexed in this index. Example: output=['global', 'author'].""" out = [] query = """SELECT name FROM idxINDEX""" res = run_sql(query) for row in res: out.append(row[0]) return out def split_ranges(parse_string): """Parse a string a return the list or ranges.""" recIDs = [] ranges = parse_string.split(",") for arange in ranges: tmp_recIDs = arange.split("-") if len(tmp_recIDs)==1: recIDs.append([int(tmp_recIDs[0]), int(tmp_recIDs[0])]) else: if int(tmp_recIDs[0]) > int(tmp_recIDs[1]): # sanity check tmp = tmp_recIDs[0] tmp_recIDs[0] = tmp_recIDs[1] tmp_recIDs[1] = tmp recIDs.append([int(tmp_recIDs[0]), int(tmp_recIDs[1])]) return recIDs def get_word_tables(tables): """ Given a list of table names it return a list of tuples (index_id, index_name, index_tags). If tables is empty it returns the whole list.""" wordTables = [] if tables: indexes = tables.split(",") for index in indexes: index_id = get_index_id_from_index_name(index) if index_id: wordTables.append((index_id, index, get_index_tags(index))) else: write_message("Error: There is no %s words table." % index, sys.stderr) else: for index in get_all_indexes(): index_id = get_index_id_from_index_name(index) wordTables.append((index_id, index, get_index_tags(index))) return wordTables def get_date_range(var): "Returns the two dates contained as a low,high tuple" limits = var.split(",") if len(limits)==1: low = get_datetime(limits[0]) return low, None if len(limits)==2: low = get_datetime(limits[0]) high = get_datetime(limits[1]) return low, high return None, None def create_range_list(res): """Creates a range list from a recID select query result contained in res. The result is expected to have ascending numerical order.""" if not res: return [] row = res[0] if not row: return [] else: range_list = [[row, row]] for row in res[1:]: row_id = row if row_id == range_list[-1][1] + 1: range_list[-1][1] = row_id else: range_list.append([row_id, row_id]) return range_list def beautify_range_list(range_list): """Returns a non overlapping, maximal range list""" ret_list = [] for new in range_list: found = 0 for old in ret_list: if new[0] <= old[0] <= new[1] + 1 or new[0] - 1 <= old[1] <= new[1]: old[0] = min(old[0], new[0]) old[1] = max(old[1], new[1]) found = 1 break if not found: ret_list.append(new) return ret_list def truncate_index_table(index_name): """Properly truncate the given index.""" index_id = get_index_id_from_index_name(index_name) if index_id: write_message('Truncating %s index table in order to reindex.' % index_name, verbose=2) run_sql("UPDATE idxINDEX SET last_updated='0000-00-00 00:00:00' WHERE id=%s", (index_id,)) run_sql("TRUNCATE idxWORD%02dF" % index_id) run_sql("TRUNCATE idxWORD%02dR" % index_id) run_sql("TRUNCATE idxPHRASE%02dF" % index_id) run_sql("TRUNCATE idxPHRASE%02dR" % index_id) def update_index_last_updated(index_id, starting_time=None): """Update last_updated column of the index table in the database. Puts starting time there so that if the task was interrupted for record download, the records will be reindexed next time.""" if starting_time is None: return None write_message("updating last_updated to %s..." % starting_time, verbose=9) return run_sql("UPDATE idxINDEX SET last_updated=%s WHERE id=%s", (starting_time, index_id,)) #def update_text_extraction_date(first_recid, last_recid): #"""for all the bibdoc connected to the specified recid, set #the text_extraction_date to the task_starting_time.""" #run_sql("UPDATE bibdoc JOIN bibrec_bibdoc ON id=id_bibdoc SET text_extraction_date=%s WHERE id_bibrec BETWEEN %s AND %s", (task_get_task_param('task_starting_time'), first_recid, last_recid)) class WordTable: "A class to hold the words table." def __init__(self, index_id, fields_to_index, table_name_pattern, default_get_words_fnc, tag_to_words_fnc_map, wash_index_terms=50, is_fulltext_index=False): """Creates words table instance. @param index_id: the index integer identificator @param fields_to_index: a list of fields to index @param table_name_pattern: i.e. idxWORD%02dF or idxPHRASE%02dF @parm default_get_words_fnc: the default function called to extract words from a metadata @param tag_to_words_fnc_map: a mapping to specify particular function to extract words from particular metdata (such as 8564_u) @param wash_index_terms: do we wash index terms, and if yes (when >0), how many characters do we keep in the index terms; see max_char_length parameter of wash_index_term() """ self.index_id = index_id self.tablename = table_name_pattern % index_id self.recIDs_in_mem = [] self.fields_to_index = fields_to_index self.value = {} self.stemming_language = get_index_stemming_language(index_id) self.is_fulltext_index = is_fulltext_index self.wash_index_terms = wash_index_terms # tagToFunctions mapping. It offers an indirection level necessary for # indexing fulltext. The default is get_words_from_phrase self.tag_to_words_fnc_map = tag_to_words_fnc_map self.default_get_words_fnc = default_get_words_fnc if self.stemming_language and self.tablename.startswith('idxWORD'): write_message('%s has stemming enabled, language %s' % (self.tablename, self.stemming_language)) def get_field(self, recID, tag): """Returns list of values of the MARC-21 'tag' fields for the record 'recID'.""" out = [] bibXXx = "bib" + tag[0] + tag[1] + "x" bibrec_bibXXx = "bibrec_" + bibXXx query = """SELECT value FROM %s AS b, %s AS bb WHERE bb.id_bibrec=%%s AND bb.id_bibxxx=b.id AND tag LIKE %%s""" % (bibXXx, bibrec_bibXXx) res = run_sql(query, (recID, tag)) for row in res: out.append(row[0]) return out def clean(self): "Cleans the words table." self.value = {} def put_into_db(self, mode="normal"): """Updates the current words table in the corresponding DB idxFOO table. Mode 'normal' means normal execution, mode 'emergency' means words index reverting to old state. """ write_message("%s %s wordtable flush started" % (self.tablename, mode)) write_message('...updating %d words into %s started' % \ (len(self.value), self.tablename)) task_update_progress("%s flushed %d/%d words" % (self.tablename, 0, len(self.value))) self.recIDs_in_mem = beautify_range_list(self.recIDs_in_mem) if mode == "normal": for group in self.recIDs_in_mem: query = """UPDATE %sR SET type='TEMPORARY' WHERE id_bibrec BETWEEN %%s AND %%s AND type='CURRENT'""" % self.tablename[:-1] write_message(query % (group[0], group[1]), verbose=9) run_sql(query, (group[0], group[1])) nb_words_total = len(self.value) nb_words_report = int(nb_words_total/10.0) nb_words_done = 0 for word in self.value.keys(): self.put_word_into_db(word) nb_words_done += 1 if nb_words_report != 0 and ((nb_words_done % nb_words_report) == 0): write_message('......processed %d/%d words' % (nb_words_done, nb_words_total)) task_update_progress("%s flushed %d/%d words" % (self.tablename, nb_words_done, nb_words_total)) write_message('...updating %d words into %s ended' % \ (nb_words_total, self.tablename)) write_message('...updating reverse table %sR started' % self.tablename[:-1]) if mode == "normal": for group in self.recIDs_in_mem: query = """UPDATE %sR SET type='CURRENT' WHERE id_bibrec BETWEEN %%s AND %%s AND type='FUTURE'""" % self.tablename[:-1] write_message(query % (group[0], group[1]), verbose=9) run_sql(query, (group[0], group[1])) query = """DELETE FROM %sR WHERE id_bibrec BETWEEN %%s AND %%s AND type='TEMPORARY'""" % self.tablename[:-1] write_message(query % (group[0], group[1]), verbose=9) run_sql(query, (group[0], group[1])) #if self.is_fulltext_index: #update_text_extraction_date(group[0], group[1]) write_message('End of updating wordTable into %s' % self.tablename, verbose=9) elif mode == "emergency": for group in self.recIDs_in_mem: query = """UPDATE %sR SET type='CURRENT' WHERE id_bibrec BETWEEN %%s AND %%s AND type='TEMPORARY'""" % self.tablename[:-1] write_message(query % (group[0], group[1]), verbose=9) run_sql(query, (group[0], group[1])) query = """DELETE FROM %sR WHERE id_bibrec BETWEEN %%s AND %%s AND type='FUTURE'""" % self.tablename[:-1] write_message(query % (group[0], group[1]), verbose=9) run_sql(query, (group[0], group[1])) write_message('End of emergency flushing wordTable into %s' % self.tablename, verbose=9) write_message('...updating reverse table %sR ended' % self.tablename[:-1]) self.clean() self.recIDs_in_mem = [] write_message("%s %s wordtable flush ended" % (self.tablename, mode)) task_update_progress("%s flush ended" % (self.tablename)) def load_old_recIDs(self, word): """Load existing hitlist for the word from the database index files.""" query = "SELECT hitlist FROM %s WHERE term=%%s" % self.tablename res = run_sql(query, (word,)) if res: return intbitset(res[0][0]) else: return None def merge_with_old_recIDs(self, word, set): """Merge the system numbers stored in memory (hash of recIDs with value +1 or -1 according to whether to add/delete them) with those stored in the database index and received in set universe of recIDs for the given word. Return False in case no change was done to SET, return True in case SET was changed. """ oldset = intbitset(set) set.update_with_signs(self.value[word]) return set != oldset def put_word_into_db(self, word): """Flush a single word to the database and delete it from memory""" set = self.load_old_recIDs(word) if set is not None: # merge the word recIDs found in memory: if not self.merge_with_old_recIDs(word,set): # nothing to update: write_message("......... unchanged hitlist for ``%s''" % word, verbose=9) pass else: # yes there were some new words: write_message("......... updating hitlist for ``%s''" % word, verbose=9) run_sql("UPDATE %s SET hitlist=%%s WHERE term=%%s" % self.tablename, (set.fastdump(), word)) else: # the word is new, will create new set: write_message("......... inserting hitlist for ``%s''" % word, verbose=9) set = intbitset(self.value[word].keys()) try: run_sql("INSERT INTO %s (term, hitlist) VALUES (%%s, %%s)" % self.tablename, (word, set.fastdump())) except Exception, e: ## We send this exception to the admin only when is not ## already reparing the problem. register_exception(prefix="Error when putting the term '%s' into db (hitlist=%s): %s\n" % (repr(word), set, e), alert_admin=(task_get_option('cmd') != 'repair')) if not set: # never store empty words run_sql("DELETE from %s WHERE term=%%s" % self.tablename, (word,)) del self.value[word] def display(self): "Displays the word table." keys = self.value.keys() keys.sort() for k in keys: write_message("%s: %s" % (k, self.value[k])) def count(self): "Returns the number of words in the table." return len(self.value) def info(self): "Prints some information on the words table." write_message("The words table contains %d words." % self.count()) def lookup_words(self, word=""): "Lookup word from the words table." if not word: done = 0 while not done: try: word = raw_input("Enter word: ") done = 1 except (EOFError, KeyboardInterrupt): return if self.value.has_key(word): write_message("The word '%s' is found %d times." \ % (word, len(self.value[word]))) else: write_message("The word '%s' does not exist in the word file."\ % word) def add_recIDs(self, recIDs, opt_flush): """Fetches records which id in the recIDs range list and adds them to the wordTable. The recIDs range list is of the form: [[i1_low,i1_high],[i2_low,i2_high], ..., [iN_low,iN_high]]. """ global chunksize, _last_word_table flush_count = 0 records_done = 0 records_to_go = 0 for arange in recIDs: records_to_go = records_to_go + arange[1] - arange[0] + 1 time_started = time.time() # will measure profile time for arange in recIDs: i_low = arange[0] chunksize_count = 0 while i_low <= arange[1]: # calculate chunk group of recIDs and treat it: i_high = min(i_low+opt_flush-flush_count-1,arange[1]) i_high = min(i_low+chunksize-chunksize_count-1, i_high) try: self.chk_recID_range(i_low, i_high) except StandardError, e: write_message("Exception caught: %s" % e, sys.stderr) register_exception(alert_admin=True) task_update_status("ERROR") self.put_into_db() sys.exit(1) write_message("%s adding records #%d-#%d started" % \ (self.tablename, i_low, i_high)) if CFG_CHECK_MYSQL_THREADS: kill_sleepy_mysql_threads() task_update_progress("%s adding recs %d-%d" % (self.tablename, i_low, i_high)) self.del_recID_range(i_low, i_high) just_processed = self.add_recID_range(i_low, i_high) flush_count = flush_count + i_high - i_low + 1 chunksize_count = chunksize_count + i_high - i_low + 1 records_done = records_done + just_processed write_message("%s adding records #%d-#%d ended " % \ (self.tablename, i_low, i_high)) if chunksize_count >= chunksize: chunksize_count = 0 # flush if necessary: if flush_count >= opt_flush: self.put_into_db() self.clean() write_message("%s backing up" % (self.tablename)) flush_count = 0 self.log_progress(time_started,records_done,records_to_go) # iterate: i_low = i_high + 1 if flush_count > 0: self.put_into_db() self.log_progress(time_started,records_done,records_to_go) def add_recIDs_by_date(self, dates, opt_flush): """Add records that were modified between DATES[0] and DATES[1]. If DATES is not set, then add records that were modified since the last update of the index. """ if not dates: table_id = self.tablename[-3:-1] query = """SELECT last_updated FROM idxINDEX WHERE id=%s""" res = run_sql(query, (table_id, )) if not res: return if not res[0][0]: dates = ("0000-00-00", None) else: dates = (res[0][0], None) if dates[1] is None: res = intbitset(run_sql("""SELECT b.id FROM bibrec AS b WHERE b.modification_date >= %s""", (dates[0],))) if self.is_fulltext_index: res |= intbitset(run_sql("""SELECT id_bibrec FROM bibrec_bibdoc JOIN bibdoc ON id_bibdoc=id WHERE text_extraction_date <= modification_date AND modification_date >= %s AND status<>'DELETED'""", (dates[0], ))) elif dates[0] is None: res = intbitset(run_sql("""SELECT b.id FROM bibrec AS b WHERE b.modification_date <= %s""", (dates[1],))) if self.is_fulltext_index: res |= intbitset(run_sql("""SELECT id_bibrec FROM bibrec_bibdoc JOIN bibdoc ON id_bibdoc=id WHERE text_extraction_date <= modification_date AND modification_date <= %s AND status<>'DELETED'""", (dates[1], ))) else: res = intbitset(run_sql("""SELECT b.id FROM bibrec AS b WHERE b.modification_date >= %s AND b.modification_date <= %s""", (dates[0], dates[1]))) if self.is_fulltext_index: res |= intbitset(run_sql("""SELECT id_bibrec FROM bibrec_bibdoc JOIN bibdoc ON id_bibdoc=id WHERE text_extraction_date <= modification_date AND modification_date >= %s AND modification_date <= %s AND status<>'DELETED'""", (dates[0], dates[1], ))) alist = create_range_list(list(res)) if not alist: write_message( "No new records added. %s is up to date" % self.tablename) else: self.add_recIDs(alist, opt_flush) def add_recID_range(self, recID1, recID2): """Add records from RECID1 to RECID2.""" wlist = {} self.recIDs_in_mem.append([recID1,recID2]) # secondly fetch all needed tags: if self.fields_to_index == [CFG_JOURNAL_TAG]: # FIXME: quick hack for the journal index; a special # treatment where we need to associate more than one # subfield into indexed term for recID in range(recID1, recID2 + 1): new_words = get_words_from_journal_tag(recID, self.fields_to_index[0]) if not wlist.has_key(recID): wlist[recID] = [] wlist[recID] = list_union(new_words, wlist[recID]) else: # usual tag-by-tag indexing: for tag in self.fields_to_index: get_words_function = self.tag_to_words_fnc_map.get(tag, self.default_get_words_fnc) bibXXx = "bib" + tag[0] + tag[1] + "x" bibrec_bibXXx = "bibrec_" + bibXXx query = """SELECT bb.id_bibrec,b.value FROM %s AS b, %s AS bb WHERE bb.id_bibrec BETWEEN %%s AND %%s AND bb.id_bibxxx=b.id AND tag LIKE %%s""" % (bibXXx, bibrec_bibXXx) res = run_sql(query, (recID1, recID2, tag)) if tag == '8564_u': ## FIXME: Quick hack to be sure that hidden files are ## actually indexed. res = set(res) for recid in xrange(int(recID1), int(recID2) + 1): for bibdocfile in BibRecDocs(recid).list_latest_files(): res.add((recid, bibdocfile.get_url())) for row in res: recID,phrase = row if not wlist.has_key(recID): wlist[recID] = [] new_words = get_words_function(phrase, stemming_language=self.stemming_language) # ,self.separators wlist[recID] = list_union(new_words, wlist[recID]) # were there some words for these recIDs found? if len(wlist) == 0: return 0 recIDs = wlist.keys() for recID in recIDs: # was this record marked as deleted? if "DELETED" in self.get_field(recID, "980__c"): wlist[recID] = [] write_message("... record %d was declared deleted, removing its word list" % recID, verbose=9) write_message("... record %d, termlist: %s" % (recID, wlist[recID]), verbose=9) # put words into reverse index table with FUTURE status: for recID in recIDs: run_sql("INSERT INTO %sR (id_bibrec,termlist,type) VALUES (%%s,%%s,'FUTURE')" % self.tablename[:-1], (recID, serialize_via_marshal(wlist[recID]))) # ... and, for new records, enter the CURRENT status as empty: try: run_sql("INSERT INTO %sR (id_bibrec,termlist,type) VALUES (%%s,%%s,'CURRENT')" % self.tablename[:-1], (recID, serialize_via_marshal([]))) except DatabaseError: # okay, it's an already existing record, no problem pass # put words into memory word list: put = self.put for recID in recIDs: for w in wlist[recID]: put(recID, w, 1) return len(recIDs) def log_progress(self, start, done, todo): """Calculate progress and store it. start: start time, done: records processed, todo: total number of records""" time_elapsed = time.time() - start # consistency check if time_elapsed == 0 or done > todo: return time_recs_per_min = done/(time_elapsed/60.0) write_message("%d records took %.1f seconds to complete.(%1.f recs/min)"\ % (done, time_elapsed, time_recs_per_min)) if time_recs_per_min: write_message("Estimated runtime: %.1f minutes" % \ ((todo-done)/time_recs_per_min)) def put(self, recID, word, sign): """Adds/deletes a word to the word list.""" try: if self.wash_index_terms: word = wash_index_term(word, self.wash_index_terms) if self.value.has_key(word): # the word 'word' exist already: update sign self.value[word][recID] = sign else: self.value[word] = {recID: sign} except: write_message("Error: Cannot put word %s with sign %d for recID %s." % (word, sign, recID)) def del_recIDs(self, recIDs): """Fetches records which id in the recIDs range list and adds them to the wordTable. The recIDs range list is of the form: [[i1_low,i1_high],[i2_low,i2_high], ..., [iN_low,iN_high]]. """ count = 0 for arange in recIDs: self.del_recID_range(arange[0],arange[1]) count = count + arange[1] - arange[0] self.put_into_db() def del_recID_range(self, low, high): """Deletes records with 'recID' system number between low and high from memory words index table.""" write_message("%s fetching existing words for records #%d-#%d started" % \ (self.tablename, low, high), verbose=3) self.recIDs_in_mem.append([low,high]) query = """SELECT id_bibrec,termlist FROM %sR as bb WHERE bb.id_bibrec BETWEEN %%s AND %%s""" % (self.tablename[:-1]) recID_rows = run_sql(query, (low, high)) for recID_row in recID_rows: recID = recID_row[0] wlist = deserialize_via_marshal(recID_row[1]) for word in wlist: self.put(recID, word, -1) write_message("%s fetching existing words for records #%d-#%d ended" % \ (self.tablename, low, high), verbose=3) def report_on_table_consistency(self): """Check reverse words index tables (e.g. idxWORD01R) for interesting states such as 'TEMPORARY' state. Prints small report (no of words, no of bad words). """ # find number of words: query = """SELECT COUNT(*) FROM %s""" % (self.tablename) res = run_sql(query, None, 1) if res: nb_words = res[0][0] else: nb_words = 0 # find number of records: query = """SELECT COUNT(DISTINCT(id_bibrec)) FROM %sR""" % (self.tablename[:-1]) res = run_sql(query, None, 1) if res: nb_records = res[0][0] else: nb_records = 0 # report stats: write_message("%s contains %d words from %d records" % (self.tablename, nb_words, nb_records)) # find possible bad states in reverse tables: query = """SELECT COUNT(DISTINCT(id_bibrec)) FROM %sR WHERE type <> 'CURRENT'""" % (self.tablename[:-1]) res = run_sql(query) if res: nb_bad_records = res[0][0] else: nb_bad_records = 999999999 if nb_bad_records: write_message("EMERGENCY: %s needs to repair %d of %d index records" % \ (self.tablename, nb_bad_records, nb_records)) else: write_message("%s is in consistent state" % (self.tablename)) return nb_bad_records def repair(self, opt_flush): """Repair the whole table""" # find possible bad states in reverse tables: query = """SELECT COUNT(DISTINCT(id_bibrec)) FROM %sR WHERE type <> 'CURRENT'""" % (self.tablename[:-1]) res = run_sql(query, None, 1) if res: nb_bad_records = res[0][0] else: nb_bad_records = 0 if nb_bad_records == 0: return query = """SELECT id_bibrec FROM %sR WHERE type <> 'CURRENT'""" \ % (self.tablename[:-1]) res = intbitset(run_sql(query)) recIDs = create_range_list(list(res)) flush_count = 0 records_done = 0 records_to_go = 0 for arange in recIDs: records_to_go = records_to_go + arange[1] - arange[0] + 1 time_started = time.time() # will measure profile time for arange in recIDs: i_low = arange[0] chunksize_count = 0 while i_low <= arange[1]: # calculate chunk group of recIDs and treat it: i_high = min(i_low+opt_flush-flush_count-1,arange[1]) i_high = min(i_low+chunksize-chunksize_count-1, i_high) try: self.fix_recID_range(i_low, i_high) except StandardError, e: write_message("Exception caught: %s" % e, sys.stderr) register_exception(alert_admin=True) task_update_status("ERROR") self.put_into_db() sys.exit(1) flush_count = flush_count + i_high - i_low + 1 chunksize_count = chunksize_count + i_high - i_low + 1 records_done = records_done + i_high - i_low + 1 if chunksize_count >= chunksize: chunksize_count = 0 # flush if necessary: if flush_count >= opt_flush: self.put_into_db("emergency") self.clean() flush_count = 0 self.log_progress(time_started,records_done,records_to_go) # iterate: i_low = i_high + 1 if flush_count > 0: self.put_into_db("emergency") self.log_progress(time_started,records_done,records_to_go) write_message("%s inconsistencies repaired." % self.tablename) def chk_recID_range(self, low, high): """Check if the reverse index table is in proper state""" ## check db query = """SELECT COUNT(*) FROM %sR WHERE type <> 'CURRENT' AND id_bibrec BETWEEN %%s AND %%s""" % self.tablename[:-1] res = run_sql(query, (low, high), 1) if res[0][0]==0: write_message("%s for %d-%d is in consistent state" % (self.tablename,low,high)) return # okay, words table is consistent ## inconsistency detected! write_message("EMERGENCY: %s inconsistencies detected..." % self.tablename) error_message = "Errors found. You should check consistency of the " \ "%s - %sR tables.\nRunning 'bibindex --repair' is " \ "recommended." % (self.tablename, self.tablename[:-1]) write_message("EMERGENCY: " + error_message, stream=sys.stderr) raise StandardError, error_message def fix_recID_range(self, low, high): """Try to fix reverse index database consistency (e.g. table idxWORD01R) in the low,high doc-id range. Possible states for a recID follow: CUR TMP FUT: very bad things have happened: warn! CUR TMP : very bad things have happened: warn! CUR FUT: delete FUT (crash before flushing) CUR : database is ok TMP FUT: add TMP to memory and del FUT from memory flush (revert to old state) TMP : very bad things have happened: warn! FUT: very bad things have happended: warn! """ state = {} query = "SELECT id_bibrec,type FROM %sR WHERE id_bibrec BETWEEN %%s AND %%s"\ % self.tablename[:-1] res = run_sql(query, (low, high)) for row in res: if not state.has_key(row[0]): state[row[0]]=[] state[row[0]].append(row[1]) ok = 1 # will hold info on whether we will be able to repair for recID in state.keys(): if not 'TEMPORARY' in state[recID]: if 'FUTURE' in state[recID]: if 'CURRENT' not in state[recID]: write_message("EMERGENCY: Index record %d is in inconsistent state. Can't repair it." % recID) ok = 0 else: write_message("EMERGENCY: Inconsistency in index record %d detected" % recID) query = """DELETE FROM %sR WHERE id_bibrec=%%s""" % self.tablename[:-1] run_sql(query, (recID, )) write_message("EMERGENCY: Inconsistency in record %d repaired." % recID) else: if 'FUTURE' in state[recID] and not 'CURRENT' in state[recID]: self.recIDs_in_mem.append([recID,recID]) # Get the words file query = """SELECT type,termlist FROM %sR WHERE id_bibrec=%%s""" % self.tablename[:-1] write_message(query, verbose=9) res = run_sql(query, (recID, )) for row in res: wlist = deserialize_via_marshal(row[1]) write_message("Words are %s " % wlist, verbose=9) if row[0] == 'TEMPORARY': sign = 1 else: sign = -1 for word in wlist: self.put(recID, word, sign) else: write_message("EMERGENCY: %s for %d is in inconsistent " "state. Couldn't repair it." % (self.tablename, recID), stream=sys.stderr) ok = 0 if not ok: error_message = "Unrepairable errors found. You should check " \ "consistency of the %s - %sR tables. Deleting affected " \ "TEMPORARY and FUTURE entries from these tables is " \ "recommended; see the BibIndex Admin Guide." % \ (self.tablename, self.tablename[:-1]) write_message("EMERGENCY: " + error_message, stream=sys.stderr) raise StandardError, error_message def main(): """Main that construct all the bibtask.""" task_init(authorization_action='runbibindex', authorization_msg="BibIndex Task Submission", description="""Examples: \t%s -a -i 234-250,293,300-500 -u admin@localhost \t%s -a -w author,fulltext -M 8192 -v3 \t%s -d -m +4d -A on --flush=10000\n""" % ((sys.argv[0],) * 3), help_specific_usage=""" Indexing options: -a, --add\t\tadd or update words for selected records -d, --del\t\tdelete words for selected records -i, --id=low[-high]\t\tselect according to doc recID -m, --modified=from[,to]\tselect according to modification date -c, --collection=c1[,c2]\tselect according to collection -R, --reindex\treindex the selected indexes from scratch Repairing options: -k, --check\t\tcheck consistency for all records in the table(s) -r, --repair\t\ttry to repair all records in the table(s) Specific options: -w, --windex=w1[,w2]\tword/phrase indexes to consider (all) -M, --maxmem=XXX\tmaximum memory usage in kB (no limit) -f, --flush=NNN\t\tfull consistent table flush after NNN records (10000) """, version=__revision__, specific_params=("adi:m:c:w:krRM:f:", [ "add", "del", "id=", "modified=", "collection=", "windex=", "check", "repair", "reindex", "maxmem=", "flush=", ]), task_stop_helper_fnc=task_stop_table_close_fnc, task_submit_elaborate_specific_parameter_fnc=task_submit_elaborate_specific_parameter, task_run_fnc=task_run_core, task_submit_check_options_fnc=task_submit_check_options) def task_submit_check_options(): """Check for options compatibility.""" if task_get_option("reindex"): if task_get_option("cmd") != "add" or task_get_option('id') or task_get_option('collection'): print >> sys.stderr, "ERROR: You can use --reindex only when adding modified record." return False return True def task_submit_elaborate_specific_parameter(key, value, opts, args): """ Given the string key it checks it's meaning, eventually using the value. Usually it fills some key in the options dict. It must return True if it has elaborated the key, False, if it doesn't know that key. eg: if key in ['-n', '--number']: self.options['number'] = value return True return False """ if key in ("-a", "--add"): task_set_option("cmd", "add") if ("-x","") in opts or ("--del","") in opts: raise StandardError, "Can not have --add and --del at the same time!" elif key in ("-k", "--check"): task_set_option("cmd", "check") elif key in ("-r", "--repair"): task_set_option("cmd", "repair") elif key in ("-d", "--del"): task_set_option("cmd", "del") elif key in ("-i", "--id"): task_set_option('id', task_get_option('id') + split_ranges(value)) elif key in ("-m", "--modified"): task_set_option("modified", get_date_range(value)) elif key in ("-c", "--collection"): task_set_option("collection", value) elif key in ("-R", "--reindex"): task_set_option("reindex", True) elif key in ("-w", "--windex"): task_set_option("windex", value) elif key in ("-M", "--maxmem"): task_set_option("maxmem", int(value)) if task_get_option("maxmem") < base_process_size + 1000: raise StandardError, "Memory usage should be higher than %d kB" % \ (base_process_size + 1000) elif key in ("-f", "--flush"): task_set_option("flush", int(value)) else: return False return True def task_stop_table_close_fnc(): """ Close tables to STOP. """ global _last_word_table if _last_word_table: _last_word_table.put_into_db() def task_run_core(): """Runs the task by fetching arguments from the BibSched task queue. This is what BibSched will be invoking via daemon call. The task prints Fibonacci numbers for up to NUM on the stdout, and some messages on stderr. Return 1 in case of success and 0 in case of failure.""" global _last_word_table if task_get_option("cmd") == "check": wordTables = get_word_tables(task_get_option("windex")) for index_id, index_name, index_tags in wordTables: if index_name == 'year' and CFG_INSPIRE_SITE: fnc_get_words_from_phrase = get_words_from_date_tag elif index_name in ('author', 'firstauthor') and \ CFG_BIBINDEX_AUTHOR_WORD_INDEX_EXCLUDE_FIRST_NAMES: fnc_get_words_from_phrase = get_author_family_name_words_from_phrase else: fnc_get_words_from_phrase = get_words_from_phrase wordTable = WordTable(index_id=index_id, fields_to_index=index_tags, table_name_pattern='idxWORD%02dF', default_get_words_fnc=fnc_get_words_from_phrase, tag_to_words_fnc_map={'8564_u': get_words_from_fulltext}, wash_index_terms=50) _last_word_table = wordTable wordTable.report_on_table_consistency() task_sleep_now_if_required(can_stop_too=True) if index_name in ('author', 'firstauthor') and \ CFG_BIBINDEX_AUTHOR_WORD_INDEX_EXCLUDE_FIRST_NAMES: fnc_get_pairs_from_phrase = get_pairs_from_phrase # FIXME else: fnc_get_pairs_from_phrase = get_pairs_from_phrase wordTable = WordTable(index_id=index_id, fields_to_index=index_tags, table_name_pattern='idxPAIR%02dF', default_get_words_fnc=fnc_get_pairs_from_phrase, tag_to_words_fnc_map={'8564_u': get_nothing_from_phrase}, wash_index_terms=100) _last_word_table = wordTable wordTable.report_on_table_consistency() task_sleep_now_if_required(can_stop_too=True) if index_name in ('author', 'firstauthor'): fnc_get_phrases_from_phrase = get_fuzzy_authors_from_phrase elif index_name == 'exactauthor': fnc_get_phrases_from_phrase = get_exact_authors_from_phrase else: fnc_get_phrases_from_phrase = get_phrases_from_phrase wordTable = WordTable(index_id=index_id, fields_to_index=index_tags, table_name_pattern='idxPHRASE%02dF', default_get_words_fnc=fnc_get_phrases_from_phrase, tag_to_words_fnc_map={'8564_u': get_nothing_from_phrase}, wash_index_terms=0) _last_word_table = wordTable wordTable.report_on_table_consistency() task_sleep_now_if_required(can_stop_too=True) _last_word_table = None return True # Let's work on single words! wordTables = get_word_tables(task_get_option("windex")) for index_id, index_name, index_tags in wordTables: is_fulltext_index = index_name == 'fulltext' reindex_prefix = "" if task_get_option("reindex"): reindex_prefix = "tmp_" init_temporary_reindex_tables(index_id, reindex_prefix) if index_name == 'year' and CFG_INSPIRE_SITE: fnc_get_words_from_phrase = get_words_from_date_tag elif index_name in ('author', 'firstauthor') and \ CFG_BIBINDEX_AUTHOR_WORD_INDEX_EXCLUDE_FIRST_NAMES: fnc_get_words_from_phrase = get_author_family_name_words_from_phrase else: fnc_get_words_from_phrase = get_words_from_phrase wordTable = WordTable(index_id=index_id, fields_to_index=index_tags, table_name_pattern=reindex_prefix + 'idxWORD%02dF', default_get_words_fnc=fnc_get_words_from_phrase, tag_to_words_fnc_map={'8564_u': get_words_from_fulltext}, is_fulltext_index=is_fulltext_index, wash_index_terms=50) _last_word_table = wordTable wordTable.report_on_table_consistency() try: if task_get_option("cmd") == "del": if task_get_option("id"): wordTable.del_recIDs(task_get_option("id")) task_sleep_now_if_required(can_stop_too=True) elif task_get_option("collection"): l_of_colls = task_get_option("collection").split(",") recIDs = perform_request_search(c=l_of_colls) recIDs_range = [] for recID in recIDs: recIDs_range.append([recID,recID]) wordTable.del_recIDs(recIDs_range) task_sleep_now_if_required(can_stop_too=True) else: error_message = "Missing IDs of records to delete from " \ "index %s." % wordTable.tablename write_message(error_message, stream=sys.stderr) raise StandardError, error_message elif task_get_option("cmd") == "add": if task_get_option("id"): wordTable.add_recIDs(task_get_option("id"), task_get_option("flush")) task_sleep_now_if_required(can_stop_too=True) elif task_get_option("collection"): l_of_colls = task_get_option("collection").split(",") recIDs = perform_request_search(c=l_of_colls) recIDs_range = [] for recID in recIDs: recIDs_range.append([recID,recID]) wordTable.add_recIDs(recIDs_range, task_get_option("flush")) task_sleep_now_if_required(can_stop_too=True) else: wordTable.add_recIDs_by_date(task_get_option("modified"), task_get_option("flush")) ## here we used to update last_updated info, if run via automatic mode; ## but do not update here anymore, since idxPHRASE will be acted upon later task_sleep_now_if_required(can_stop_too=True) elif task_get_option("cmd") == "repair": wordTable.repair(task_get_option("flush")) task_sleep_now_if_required(can_stop_too=True) else: error_message = "Invalid command found processing %s" % \ wordTable.tablename write_message(error_message, stream=sys.stderr) raise StandardError, error_message except StandardError, e: write_message("Exception caught: %s" % e, sys.stderr) register_exception(alert_admin=True) task_update_status("ERROR") if _last_word_table: _last_word_table.put_into_db() sys.exit(1) wordTable.report_on_table_consistency() task_sleep_now_if_required(can_stop_too=True) # Let's work on pairs now if index_name in ('author', 'firstauthor') and \ CFG_BIBINDEX_AUTHOR_WORD_INDEX_EXCLUDE_FIRST_NAMES: fnc_get_pairs_from_phrase = get_pairs_from_phrase # FIXME else: fnc_get_pairs_from_phrase = get_pairs_from_phrase wordTable = WordTable(index_id=index_id, fields_to_index=index_tags, table_name_pattern=reindex_prefix + 'idxPAIR%02dF', default_get_words_fnc=fnc_get_pairs_from_phrase, tag_to_words_fnc_map={'8564_u': get_nothing_from_phrase}, wash_index_terms=100) _last_word_table = wordTable wordTable.report_on_table_consistency() try: if task_get_option("cmd") == "del": if task_get_option("id"): wordTable.del_recIDs(task_get_option("id")) task_sleep_now_if_required(can_stop_too=True) elif task_get_option("collection"): l_of_colls = task_get_option("collection").split(",") recIDs = perform_request_search(c=l_of_colls) recIDs_range = [] for recID in recIDs: recIDs_range.append([recID,recID]) wordTable.del_recIDs(recIDs_range) task_sleep_now_if_required(can_stop_too=True) else: error_message = "Missing IDs of records to delete from " \ "index %s." % wordTable.tablename write_message(error_message, stream=sys.stderr) raise StandardError, error_message elif task_get_option("cmd") == "add": if task_get_option("id"): wordTable.add_recIDs(task_get_option("id"), task_get_option("flush")) task_sleep_now_if_required(can_stop_too=True) elif task_get_option("collection"): l_of_colls = task_get_option("collection").split(",") recIDs = perform_request_search(c=l_of_colls) recIDs_range = [] for recID in recIDs: recIDs_range.append([recID,recID]) wordTable.add_recIDs(recIDs_range, task_get_option("flush")) task_sleep_now_if_required(can_stop_too=True) else: wordTable.add_recIDs_by_date(task_get_option("modified"), task_get_option("flush")) # let us update last_updated timestamp info, if run via automatic mode: task_sleep_now_if_required(can_stop_too=True) elif task_get_option("cmd") == "repair": wordTable.repair(task_get_option("flush")) task_sleep_now_if_required(can_stop_too=True) else: error_message = "Invalid command found processing %s" % \ wordTable.tablename write_message(error_message, stream=sys.stderr) raise StandardError, error_message except StandardError, e: write_message("Exception caught: %s" % e, sys.stderr) register_exception() task_update_status("ERROR") if _last_word_table: _last_word_table.put_into_db() sys.exit(1) wordTable.report_on_table_consistency() task_sleep_now_if_required(can_stop_too=True) # Let's work on phrases now if index_name in ('author', 'firstauthor'): fnc_get_phrases_from_phrase = get_fuzzy_authors_from_phrase elif index_name == 'exactauthor': fnc_get_phrases_from_phrase = get_exact_authors_from_phrase else: fnc_get_phrases_from_phrase = get_phrases_from_phrase wordTable = WordTable(index_id=index_id, fields_to_index=index_tags, table_name_pattern=reindex_prefix + 'idxPHRASE%02dF', default_get_words_fnc=fnc_get_phrases_from_phrase, tag_to_words_fnc_map={'8564_u': get_nothing_from_phrase}, wash_index_terms=0) _last_word_table = wordTable wordTable.report_on_table_consistency() try: if task_get_option("cmd") == "del": if task_get_option("id"): wordTable.del_recIDs(task_get_option("id")) task_sleep_now_if_required(can_stop_too=True) elif task_get_option("collection"): l_of_colls = task_get_option("collection").split(",") recIDs = perform_request_search(c=l_of_colls) recIDs_range = [] for recID in recIDs: recIDs_range.append([recID,recID]) wordTable.del_recIDs(recIDs_range) task_sleep_now_if_required(can_stop_too=True) else: error_message = "Missing IDs of records to delete from " \ "index %s." % wordTable.tablename write_message(error_message, stream=sys.stderr) raise StandardError, error_message elif task_get_option("cmd") == "add": if task_get_option("id"): wordTable.add_recIDs(task_get_option("id"), task_get_option("flush")) task_sleep_now_if_required(can_stop_too=True) elif task_get_option("collection"): l_of_colls = task_get_option("collection").split(",") recIDs = perform_request_search(c=l_of_colls) recIDs_range = [] for recID in recIDs: recIDs_range.append([recID,recID]) wordTable.add_recIDs(recIDs_range, task_get_option("flush")) task_sleep_now_if_required(can_stop_too=True) else: wordTable.add_recIDs_by_date(task_get_option("modified"), task_get_option("flush")) # let us update last_updated timestamp info, if run via automatic mode: update_index_last_updated(index_id, task_get_task_param('task_starting_time')) task_sleep_now_if_required(can_stop_too=True) elif task_get_option("cmd") == "repair": wordTable.repair(task_get_option("flush")) task_sleep_now_if_required(can_stop_too=True) else: error_message = "Invalid command found processing %s" % \ wordTable.tablename write_message(error_message, stream=sys.stderr) raise StandardError, error_message except StandardError, e: write_message("Exception caught: %s" % e, sys.stderr) register_exception() task_update_status("ERROR") if _last_word_table: _last_word_table.put_into_db() sys.exit(1) wordTable.report_on_table_consistency() task_sleep_now_if_required(can_stop_too=True) if task_get_option("reindex"): swap_temporary_reindex_tables(index_id, reindex_prefix) update_index_last_updated(index_id, task_get_task_param('task_starting_time')) task_sleep_now_if_required(can_stop_too=True) _last_word_table = None return True -## import optional modules: -try: - import psyco - psyco.bind(get_words_from_phrase) - psyco.bind(WordTable.merge_with_old_recIDs) -except: - pass - - ### okay, here we go: if __name__ == '__main__': main() diff --git a/modules/bibrank/lib/bibrank_record_sorter.py b/modules/bibrank/lib/bibrank_record_sorter.py index a79f29c26..baa311112 100644 --- a/modules/bibrank/lib/bibrank_record_sorter.py +++ b/modules/bibrank/lib/bibrank_record_sorter.py @@ -1,688 +1,677 @@ # -*- coding: utf-8 -*- ## Ranking of records using different parameters and methods on the fly. ## ## This file is part of Invenio. ## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. __revision__ = "$Id$" import string import time import math import re import ConfigParser import copy from invenio.config import \ CFG_SITE_LANG, \ CFG_ETCDIR from invenio.dbquery import run_sql, deserialize_via_marshal from invenio.errorlib import register_exception from invenio.webpage import adderrorbox from invenio.bibindex_engine_stemmer import stem from invenio.bibindex_engine_stopwords import is_stopword from invenio.bibrank_citation_searcher import get_cited_by, get_cited_by_weight from invenio.intbitset import intbitset def compare_on_val(first, second): return cmp(second[1], first[1]) def check_term(term, col_size, term_rec, max_occ, min_occ, termlength): """Check if the tem is valid for use term - the term to check col_size - the number of records in database term_rec - the number of records which contains this term max_occ - max frequency of the term allowed min_occ - min frequence of the term allowed termlength - the minimum length of the terms allowed""" try: if is_stopword(term, 1) or (len(term) <= termlength) or ((float(term_rec) / float(col_size)) >= max_occ) or ((float(term_rec) / float(col_size)) <= min_occ): return "" if int(term): return "" except StandardError, e: pass return "true" def create_rnkmethod_cache(): """Create cache with vital information for each rank method.""" global methods bibrank_meths = run_sql("SELECT name from rnkMETHOD") methods = {} global voutput voutput = "" for (rank_method_code,) in bibrank_meths: try: file = CFG_ETCDIR + "/bibrank/" + rank_method_code + ".cfg" config = ConfigParser.ConfigParser() config.readfp(open(file)) except StandardError, e: pass cfg_function = config.get("rank_method", "function") if config.has_section(cfg_function): methods[rank_method_code] = {} methods[rank_method_code]["function"] = cfg_function methods[rank_method_code]["prefix"] = config.get(cfg_function, "relevance_number_output_prologue") methods[rank_method_code]["postfix"] = config.get(cfg_function, "relevance_number_output_epilogue") methods[rank_method_code]["chars_alphanumericseparators"] = r"[1234567890\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\@\[\\\]\^\_\`\{\|\}\~]" else: raise Exception("Error in configuration file: %s" % (CFG_ETCDIR + "/bibrank/" + rank_method_code + ".cfg")) i8n_names = run_sql("""SELECT ln,value from rnkMETHODNAME,rnkMETHOD where id_rnkMETHOD=rnkMETHOD.id and rnkMETHOD.name=%s""", (rank_method_code,)) for (ln, value) in i8n_names: methods[rank_method_code][ln] = value if config.has_option(cfg_function, "table"): methods[rank_method_code]["rnkWORD_table"] = config.get(cfg_function, "table") methods[rank_method_code]["col_size"] = run_sql("SELECT count(*) FROM %sR" % methods[rank_method_code]["rnkWORD_table"][:-1])[0][0] if config.has_option(cfg_function, "stemming") and config.get(cfg_function, "stemming"): try: methods[rank_method_code]["stemmer"] = config.get(cfg_function, "stemming") except Exception,e: pass if config.has_option(cfg_function, "stopword"): methods[rank_method_code]["stopwords"] = config.get(cfg_function, "stopword") if config.has_section("find_similar"): methods[rank_method_code]["max_word_occurence"] = float(config.get("find_similar", "max_word_occurence")) methods[rank_method_code]["min_word_occurence"] = float(config.get("find_similar", "min_word_occurence")) methods[rank_method_code]["min_word_length"] = int(config.get("find_similar", "min_word_length")) methods[rank_method_code]["min_nr_words_docs"] = int(config.get("find_similar", "min_nr_words_docs")) methods[rank_method_code]["max_nr_words_upper"] = int(config.get("find_similar", "max_nr_words_upper")) methods[rank_method_code]["max_nr_words_lower"] = int(config.get("find_similar", "max_nr_words_lower")) methods[rank_method_code]["default_min_relevance"] = int(config.get("find_similar", "default_min_relevance")) if config.has_section("combine_method"): i = 1 methods[rank_method_code]["combine_method"] = [] while config.has_option("combine_method", "method%s" % i): methods[rank_method_code]["combine_method"].append(string.split(config.get("combine_method", "method%s" % i), ",")) i += 1 def is_method_valid(colID, rank_method_code): """ Check if RANK_METHOD_CODE method is valid for the collection given. If colID is None, then check for existence regardless of collection. """ if colID is None: return run_sql("SELECT COUNT(*) FROM rnkMETHOD WHERE name=%s", (rank_method_code,))[0][0] enabled_colls = dict(run_sql("SELECT id_collection, score from collection_rnkMETHOD,rnkMETHOD WHERE id_rnkMETHOD=rnkMETHOD.id AND name='%s'" % rank_method_code)) try: colID = int(colID) except TypeError: return 0 if enabled_colls.has_key(colID): return 1 else: while colID: colID = run_sql("SELECT id_dad FROM collection_collection WHERE id_son=%s" % colID) if colID and enabled_colls.has_key(colID[0][0]): return 1 elif colID: colID = colID[0][0] return 0 def get_bibrank_methods(colID, ln=CFG_SITE_LANG): """ Return a list of rank methods enabled for collection colID and the name of them in the language defined by the ln parameter. """ if not globals().has_key('methods'): create_rnkmethod_cache() avail_methods = [] for (rank_method_code, options) in methods.iteritems(): if options.has_key("function") and is_method_valid(colID, rank_method_code): if options.has_key(ln): avail_methods.append((rank_method_code, options[ln])) elif options.has_key(CFG_SITE_LANG): avail_methods.append((rank_method_code, options[CFG_SITE_LANG])) else: avail_methods.append((rank_method_code, rank_method_code)) return avail_methods def rank_records(rank_method_code, rank_limit_relevance, hitset_global, pattern=[], verbose=0): """rank_method_code, e.g. `jif' or `sbr' (word frequency vector model) rank_limit_relevance, e.g. `23' for `nbc' (number of citations) or `0.10' for `vec' hitset, search engine hits; pattern, search engine query or record ID (you check the type) verbose, verbose level output: list of records list of rank values prefix postfix verbose_output""" global voutput voutput = "" configcreated = "" starttime = time.time() afterfind = starttime - time.time() aftermap = starttime - time.time() try: hitset = copy.deepcopy(hitset_global) #we are receiving a global hitset if not globals().has_key('methods'): create_rnkmethod_cache() function = methods[rank_method_code]["function"] #we get 'citation' method correctly here func_object = globals().get(function) if func_object and pattern and pattern[0][0:6] == "recid:" and function == "word_similarity": result = find_similar(rank_method_code, pattern[0][6:], hitset, rank_limit_relevance, verbose) elif rank_method_code == "citation": #we get rank_method_code correctly here. pattern[0] is the search word - not used by find_cit p = "" if pattern and pattern[0]: p = pattern[0][6:] result = find_citations(rank_method_code, p, hitset, verbose) elif func_object: result = func_object(rank_method_code, pattern, hitset, rank_limit_relevance, verbose) else: result = rank_by_method(rank_method_code, pattern, hitset, rank_limit_relevance, verbose) except Exception, e: register_exception() result = (None, "", adderrorbox("An error occured when trying to rank the search result "+rank_method_code, ["Unexpected error: %s
" % (e,)]), voutput) afterfind = time.time() - starttime if result[0] and result[1]: #split into two lists for search_engine results_similar_recIDs = map(lambda x: x[0], result[0]) results_similar_relevances = map(lambda x: x[1], result[0]) result = (results_similar_recIDs, results_similar_relevances, result[1], result[2], "%s" % configcreated + result[3]) aftermap = time.time() - starttime; else: result = (None, None, result[1], result[2], result[3]) if verbose > 0: voutput = voutput+"\nElapsed time after finding: "+str(afterfind)+"\nElapsed after mapping: "+str(aftermap) #add stuff from here into voutput from result tmp = result[4]+voutput result = (result[0],result[1],result[2],result[3],tmp) #dbg = string.join(map(str,methods[rank_method_code].items())) #result = (None, "", adderrorbox("Debug ",rank_method_code+" "+dbg),"",voutput); return result def combine_method(rank_method_code, pattern, hitset, rank_limit_relevance,verbose): """combining several methods into one based on methods/percentage in config file""" global voutput result = {} try: for (method, percent) in methods[rank_method_code]["combine_method"]: function = methods[method]["function"] func_object = globals().get(function) percent = int(percent) if func_object: this_result = func_object(method, pattern, hitset, rank_limit_relevance, verbose)[0] else: this_result = rank_by_method(method, pattern, hitset, rank_limit_relevance, verbose)[0] for i in range(0, len(this_result)): (recID, value) = this_result[i] if value > 0: result[recID] = result.get(recID, 0) + int((float(i) / len(this_result)) * float(percent)) result = result.items() result.sort(lambda x, y: cmp(x[1], y[1])) return (result, "(", ")", voutput) except Exception, e: return (None, "Warning: %s method cannot be used for ranking your query." % rank_method_code, "", voutput) def rank_by_method(rank_method_code, lwords, hitset, rank_limit_relevance,verbose): """Ranking of records based on predetermined values. input: rank_method_code - the code of the method, from the name field in rnkMETHOD, used to get predetermined values from rnkMETHODDATA lwords - a list of words from the query hitset - a list of hits for the query found by search_engine rank_limit_relevance - show only records with a rank value above this verbose - verbose value output: reclist - a list of sorted records, with unsorted added to the end: [[23,34], [344,24], [1,01]] prefix - what to show before the rank value postfix - what to show after the rank value voutput - contains extra information, content dependent on verbose value""" global voutput rnkdict = run_sql("SELECT relevance_data FROM rnkMETHODDATA,rnkMETHOD where rnkMETHOD.id=id_rnkMETHOD and rnkMETHOD.name='%s'" % rank_method_code) if not rnkdict: return (None, "Warning: Could not load ranking data for method %s." % rank_method_code, "", voutput) max_recid = 0 res = run_sql("SELECT max(id) FROM bibrec") if res and res[0][0]: max_recid = int(res[0][0]) lwords_hitset = None for j in range(0, len(lwords)): #find which docs to search based on ranges..should be done in search_engine... if lwords[j] and lwords[j][:6] == "recid:": if not lwords_hitset: lwords_hitset = intbitset() lword = lwords[j][6:] if string.find(lword, "->") > -1: lword = string.split(lword, "->") if int(lword[0]) >= max_recid or int(lword[1]) >= max_recid + 1: return (None, "Warning: Given record IDs are out of range.", "", voutput) for i in range(int(lword[0]), int(lword[1])): lwords_hitset.add(int(i)) elif lword < max_recid + 1: lwords_hitset.add(int(lword)) else: return (None, "Warning: Given record IDs are out of range.", "", voutput) rnkdict = deserialize_via_marshal(rnkdict[0][0]) if verbose > 0: voutput += "
Running rank method: %s, using rank_by_method function in bibrank_record_sorter
" % rank_method_code voutput += "Ranking data loaded, size of structure: %s
" % len(rnkdict) lrecIDs = list(hitset) if verbose > 0: voutput += "Number of records to rank: %s
" % len(lrecIDs) reclist = [] reclist_addend = [] if not lwords_hitset: #rank all docs, can this be speed up using something else than for loop? for recID in lrecIDs: if rnkdict.has_key(recID): reclist.append((recID, rnkdict[recID])) del rnkdict[recID] else: reclist_addend.append((recID, 0)) else: #rank docs in hitset, can this be speed up using something else than for loop? for recID in lwords_hitset: if rnkdict.has_key(recID) and recID in hitset: reclist.append((recID, rnkdict[recID])) del rnkdict[recID] elif recID in hitset: reclist_addend.append((recID, 0)) if verbose > 0: voutput += "Number of records ranked: %s
" % len(reclist) voutput += "Number of records not ranked: %s
" % len(reclist_addend) reclist.sort(lambda x, y: cmp(x[1], y[1])) return (reclist_addend + reclist, methods[rank_method_code]["prefix"], methods[rank_method_code]["postfix"], voutput) def find_citations(rank_method_code, recID, hitset, verbose): """Rank by the amount of citations.""" #calculate the cited-by values for all the members of the hitset #returns: ((recordid,weight),prefix,postfix,message) global voutput voutput = "" #If the recID is numeric, return only stuff that cites it. Otherwise return #stuff that cites hitset #try to convert to int recisint = True recidint = 0 try: recidint = int(recID) except: recisint = False ret = [] if recisint: myrecords = get_cited_by(recidint) #this is a simple list ret = get_cited_by_weight(myrecords) else: ret = get_cited_by_weight(hitset) ret.sort(lambda x,y:cmp(x[1],y[1])) #ascending by the second member of the tuples if verbose > 0: voutput = voutput+"\nrecID "+str(recID)+" is int: "+str(recisint)+" hitset "+str(hitset)+"\n"+"find_citations retlist "+str(ret) #voutput = voutput + str(ret) if ret: return (ret,"(", ")", "") else: return ((),"", "", "") def find_similar(rank_method_code, recID, hitset, rank_limit_relevance,verbose): """Finding terms to use for calculating similarity. Terms are taken from the recid given, returns a list of recids's and relevance, input: rank_method_code - the code of the method, from the name field in rnkMETHOD recID - records to use for find similar hitset - a list of hits for the query found by search_engine rank_limit_relevance - show only records with a rank value above this verbose - verbose value output: reclist - a list of sorted records: [[23,34], [344,24], [1,01]] prefix - what to show before the rank value postfix - what to show after the rank value voutput - contains extra information, content dependent on verbose value""" startCreate = time.time() global voutput if verbose > 0: voutput += "
Running rank method: %s, using find_similar/word_frequency in bibrank_record_sorter
" % rank_method_code rank_limit_relevance = methods[rank_method_code]["default_min_relevance"] try: recID = int(recID) except Exception,e : return (None, "Warning: Error in record ID, please check that a number is given.", "", voutput) rec_terms = run_sql("""SELECT termlist FROM %sR WHERE id_bibrec=%%s""" % methods[rank_method_code]["rnkWORD_table"][:-1], (recID,)) if not rec_terms: return (None, "Warning: Requested record does not seem to exist.", "", voutput) rec_terms = deserialize_via_marshal(rec_terms[0][0]) #Get all documents using terms from the selected documents if len(rec_terms) == 0: return (None, "Warning: Record specified has no content indexed for use with this method.", "", voutput) else: terms = "%s" % rec_terms.keys() terms_recs = dict(run_sql("""SELECT term, hitlist FROM %s WHERE term IN (%s)""" % (methods[rank_method_code]["rnkWORD_table"], terms[1:len(terms) - 1]))) tf_values = {} #Calculate all term frequencies for (term, tf) in rec_terms.iteritems(): if len(term) >= methods[rank_method_code]["min_word_length"] and terms_recs.has_key(term) and tf[1] != 0: tf_values[term] = int((1 + math.log(tf[0])) * tf[1]) #calculate term weigth tf_values = tf_values.items() tf_values.sort(lambda x, y: cmp(y[1], x[1])) #sort based on weigth lwords = [] stime = time.time() (recdict, rec_termcount) = ({}, {}) for (t, tf) in tf_values: #t=term, tf=term frequency term_recs = deserialize_via_marshal(terms_recs[t]) if len(tf_values) <= methods[rank_method_code]["max_nr_words_lower"] or (len(term_recs) >= methods[rank_method_code]["min_nr_words_docs"] and (((float(len(term_recs)) / float(methods[rank_method_code]["col_size"])) <= methods[rank_method_code]["max_word_occurence"]) and ((float(len(term_recs)) / float(methods[rank_method_code]["col_size"])) >= methods[rank_method_code]["min_word_occurence"]))): #too complicated...something must be done lwords.append((t, methods[rank_method_code]["rnkWORD_table"])) #list of terms used (recdict, rec_termcount) = calculate_record_relevance_findsimilar((t, round(tf, 4)) , term_recs, hitset, recdict, rec_termcount, verbose, "true") #true tells the function to not calculate all unimportant terms if len(tf_values) > methods[rank_method_code]["max_nr_words_lower"] and (len(lwords) == methods[rank_method_code]["max_nr_words_upper"] or tf < 0): break if len(recdict) == 0 or len(lwords) == 0: return (None, "Could not find any similar documents, possibly because of error in ranking data.", "", voutput) else: #sort if we got something to sort (reclist, hitset) = sort_record_relevance_findsimilar(recdict, rec_termcount, hitset, rank_limit_relevance, verbose) if verbose > 0: voutput += "
Number of terms: %s
" % run_sql("SELECT count(id) FROM %s" % methods[rank_method_code]["rnkWORD_table"])[0][0] voutput += "Number of terms to use for query: %s
" % len(lwords) voutput += "Terms: %s
" % lwords voutput += "Current number of recIDs: %s
" % (methods[rank_method_code]["col_size"]) voutput += "Prepare time: %s
" % (str(time.time() - startCreate)) voutput += "Total time used: %s
" % (str(time.time() - startCreate)) rank_method_stat(rank_method_code, reclist, lwords) return (reclist[:len(reclist)], methods[rank_method_code]["prefix"], methods[rank_method_code]["postfix"], voutput) def word_similarity(rank_method_code, lwords, hitset, rank_limit_relevance, verbose): """Ranking a records containing specified words and returns a sorted list. input: rank_method_code - the code of the method, from the name field in rnkMETHOD lwords - a list of words from the query hitset - a list of hits for the query found by search_engine rank_limit_relevance - show only records with a rank value above this verbose - verbose value output: reclist - a list of sorted records: [[23,34], [344,24], [1,01]] prefix - what to show before the rank value postfix - what to show after the rank value voutput - contains extra information, content dependent on verbose value""" global voutput startCreate = time.time() if verbose > 0: voutput += "
Running rank method: %s, using word_frequency function in bibrank_record_sorter
" % rank_method_code lwords_old = lwords lwords = [] #Check terms, remove non alphanumeric characters. Use both unstemmed and stemmed version of all terms. for i in range(0, len(lwords_old)): term = string.lower(lwords_old[i]) if not methods[rank_method_code]["stopwords"] == "True" or methods[rank_method_code]["stopwords"] and not is_stopword(term, 1): lwords.append((term, methods[rank_method_code]["rnkWORD_table"])) terms = string.split(string.lower(re.sub(methods[rank_method_code]["chars_alphanumericseparators"], ' ', term))) for term in terms: if methods[rank_method_code].has_key("stemmer"): # stem word term = stem(string.replace(term, ' ', ''), methods[rank_method_code]["stemmer"]) if lwords_old[i] != term: #add if stemmed word is different than original word lwords.append((term, methods[rank_method_code]["rnkWORD_table"])) (recdict, rec_termcount, lrecIDs_remove) = ({}, {}, {}) #For each term, if accepted, get a list of the records using the term #calculate then relevance for each term before sorting the list of records for (term, table) in lwords: term_recs = run_sql("""SELECT term, hitlist FROM %s WHERE term=%%s""" % methods[rank_method_code]["rnkWORD_table"], (term,)) if term_recs: #if term exists in database, use for ranking term_recs = deserialize_via_marshal(term_recs[0][1]) (recdict, rec_termcount) = calculate_record_relevance((term, int(term_recs["Gi"][1])) , term_recs, hitset, recdict, rec_termcount, verbose, quick=None) del term_recs if len(recdict) == 0 or (len(lwords) == 1 and lwords[0] == ""): return (None, "Records not ranked. The query is not detailed enough, or not enough records found, for ranking to be possible.", "", voutput) else: #sort if we got something to sort (reclist, hitset) = sort_record_relevance(recdict, rec_termcount, hitset, rank_limit_relevance, verbose) #Add any documents not ranked to the end of the list if hitset: lrecIDs = list(hitset) #using 2-3mb reclist = zip(lrecIDs, [0] * len(lrecIDs)) + reclist #using 6mb if verbose > 0: voutput += "
Current number of recIDs: %s
" % (methods[rank_method_code]["col_size"]) voutput += "Number of terms: %s
" % run_sql("SELECT count(id) FROM %s" % methods[rank_method_code]["rnkWORD_table"])[0][0] voutput += "Terms: %s
" % lwords voutput += "Prepare and pre calculate time: %s
" % (str(time.time() - startCreate)) voutput += "Total time used: %s
" % (str(time.time() - startCreate)) rank_method_stat(rank_method_code, reclist, lwords) return (reclist, methods[rank_method_code]["prefix"], methods[rank_method_code]["postfix"], voutput) def calculate_record_relevance(term, invidx, hitset, recdict, rec_termcount, verbose, quick=None): """Calculating the relevance of the documents based on the input, calculates only one word term - (term, query term factor) the term and its importance in the overall search invidx - {recid: tf, Gi: norm value} The Gi value is used as a idf value hitset - a hitset with records that are allowed to be ranked recdict - contains currently ranked records, is returned with new values rec_termcount - {recid: count} the number of terms in this record that matches the query verbose - verbose value quick - if quick=yes only terms with a positive qtf is used, to limit the number of records to sort""" (t, qtf) = term if invidx.has_key("Gi"):#Gi = weigth for this term, created by bibrank_word_indexer Gi = invidx["Gi"][1] del invidx["Gi"] else: #if not existing, bibrank should be run with -R return (recdict, rec_termcount) if not quick or (qtf >= 0 or (qtf < 0 and len(recdict) == 0)): #Only accept records existing in the hitset received from the search engine for (j, tf) in invidx.iteritems(): if j in hitset:#only include docs found by search_engine based on query try: #calculates rank value recdict[j] = recdict.get(j, 0) + int(math.log(tf[0] * Gi * tf[1] * qtf)) except: return (recdict, rec_termcount) rec_termcount[j] = rec_termcount.get(j, 0) + 1 #number of terms from query in document elif quick: #much used term, do not include all records, only use already existing ones for (j, tf) in recdict.iteritems(): #i.e: if doc contains important term, also count unimportant if invidx.has_key(j): tf = invidx[j] recdict[j] = recdict.get(j, 0) + int(math.log(tf[0] * Gi * tf[1] * qtf)) rec_termcount[j] = rec_termcount.get(j, 0) + 1 #number of terms from query in document return (recdict, rec_termcount) def calculate_record_relevance_findsimilar(term, invidx, hitset, recdict, rec_termcount, verbose, quick=None): """Calculating the relevance of the documents based on the input, calculates only one word term - (term, query term factor) the term and its importance in the overall search invidx - {recid: tf, Gi: norm value} The Gi value is used as a idf value hitset - a hitset with records that are allowed to be ranked recdict - contains currently ranked records, is returned with new values rec_termcount - {recid: count} the number of terms in this record that matches the query verbose - verbose value quick - if quick=yes only terms with a positive qtf is used, to limit the number of records to sort""" (t, qtf) = term if invidx.has_key("Gi"): #Gi = weigth for this term, created by bibrank_word_indexer Gi = invidx["Gi"][1] del invidx["Gi"] else: #if not existing, bibrank should be run with -R return (recdict, rec_termcount) if not quick or (qtf >= 0 or (qtf < 0 and len(recdict) == 0)): #Only accept records existing in the hitset received from the search engine for (j, tf) in invidx.iteritems(): if j in hitset: #only include docs found by search_engine based on query #calculate rank value recdict[j] = recdict.get(j, 0) + int((1 + math.log(tf[0])) * Gi * tf[1] * qtf) rec_termcount[j] = rec_termcount.get(j, 0) + 1 #number of terms from query in document elif quick: #much used term, do not include all records, only use already existing ones for (j, tf) in recdict.iteritems(): #i.e: if doc contains important term, also count unimportant if invidx.has_key(j): tf = invidx[j] recdict[j] = recdict[j] + int((1 + math.log(tf[0])) * Gi * tf[1] * qtf) rec_termcount[j] = rec_termcount.get(j, 0) + 1 #number of terms from query in document return (recdict, rec_termcount) def sort_record_relevance(recdict, rec_termcount, hitset, rank_limit_relevance, verbose): """Sorts the dictionary and returns records with a relevance higher than the given value. recdict - {recid: value} unsorted rank_limit_relevance - a value > 0 usually verbose - verbose value""" startCreate = time.time() global voutput reclist = [] #remove all ranked documents so that unranked can be added to the end hitset -= recdict.keys() #gives each record a score between 0-100 divideby = max(recdict.values()) for (j, w) in recdict.iteritems(): w = int(w * 100 / divideby) if w >= rank_limit_relevance: reclist.append((j, w)) #sort scores reclist.sort(lambda x, y: cmp(x[1], y[1])) if verbose > 0: voutput += "Number of records sorted: %s
" % len(reclist) voutput += "Sort time: %s
" % (str(time.time() - startCreate)) return (reclist, hitset) def sort_record_relevance_findsimilar(recdict, rec_termcount, hitset, rank_limit_relevance, verbose): """Sorts the dictionary and returns records with a relevance higher than the given value. recdict - {recid: value} unsorted rank_limit_relevance - a value > 0 usually verbose - verbose value""" startCreate = time.time() global voutput reclist = [] #Multiply with the number of terms of the total number of terms in the query existing in the records for j in recdict.keys(): if recdict[j] > 0 and rec_termcount[j] > 1: recdict[j] = math.log((recdict[j] * rec_termcount[j])) else: recdict[j] = 0 hitset -= recdict.keys() #gives each record a score between 0-100 divideby = max(recdict.values()) for (j, w) in recdict.iteritems(): w = int(w * 100 / divideby) if w >= rank_limit_relevance: reclist.append((j, w)) #sort scores reclist.sort(lambda x, y: cmp(x[1], y[1])) if verbose > 0: voutput += "Number of records sorted: %s
" % len(reclist) voutput += "Sort time: %s
" % (str(time.time() - startCreate)) return (reclist, hitset) def rank_method_stat(rank_method_code, reclist, lwords): """Shows some statistics about the searchresult. rank_method_code - name field from rnkMETHOD reclist - a list of sorted and ranked records lwords - the words in the query""" global voutput if len(reclist) > 20: j = 20 else: j = len(reclist) voutput += "
Rank statistics:
" for i in range(1, j + 1): voutput += "%s,Recid:%s,Score:%s
" % (i,reclist[len(reclist) - i][0],reclist[len(reclist) - i][1]) for (term, table) in lwords: term_recs = run_sql("""SELECT hitlist FROM %s WHERE term=%%s""" % table, (term,)) if term_recs: term_recs = deserialize_via_marshal(term_recs[0][0]) if term_recs.has_key(reclist[len(reclist) - i][0]): voutput += "%s-%s / " % (term, term_recs[reclist[len(reclist) - i][0]]) voutput += "
" voutput += "
Score variation:
" count = {} for i in range(0, len(reclist)): count[reclist[i][1]] = count.get(reclist[i][1], 0) + 1 i = 100 while i >= 0: if count.has_key(i): voutput += "%s-%s
" % (i, count[i]) i -= 1 - -try: - import psyco - psyco.bind(find_similar) - psyco.bind(rank_by_method) - psyco.bind(calculate_record_relevance) - psyco.bind(word_similarity) - psyco.bind(sort_record_relevance) -except StandardError, e: - pass - diff --git a/modules/bibrank/lib/bibrank_tag_based_indexer.py b/modules/bibrank/lib/bibrank_tag_based_indexer.py index 57f69b9b8..a17e19168 100644 --- a/modules/bibrank/lib/bibrank_tag_based_indexer.py +++ b/modules/bibrank/lib/bibrank_tag_based_indexer.py @@ -1,478 +1,468 @@ # -*- coding: utf-8 -*- ## Ranking of records using different parameters and methods. ## This file is part of Invenio. ## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2012 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. __revision__ = "$Id$" import os import sys import time import ConfigParser from invenio.config import \ CFG_SITE_LANG, \ CFG_ETCDIR, \ CFG_PREFIX from invenio.search_engine import perform_request_search, HitSet from invenio.bibrank_citation_indexer import get_citation_weight, print_missing, get_cit_dict, insert_into_cit_db from invenio.bibrank_downloads_indexer import * from invenio.dbquery import run_sql, serialize_via_marshal, deserialize_via_marshal from invenio.errorlib import register_exception from invenio.bibtask import task_get_option, write_message, task_sleep_now_if_required from invenio.bibindex_engine import create_range_list options = {} def remove_auto_cites(dic): """Remove auto-cites and dedupe.""" for key in dic.keys(): new_list = dic.fromkeys(dic[key]).keys() try: new_list.remove(key) except ValueError: pass dic[key] = new_list return dic def citation_repair_exec(): """Repair citation ranking method""" ## repair citations for rowname in ["citationdict","reversedict"]: ## get dic dic = get_cit_dict(rowname) ## repair write_message("Repairing %s" % rowname) dic = remove_auto_cites(dic) ## store healthy citation dic insert_into_cit_db(dic, rowname) return def download_weight_filtering_user_repair_exec (): """Repair download weight filtering user ranking method""" write_message("Repairing for this ranking method is not defined. Skipping.") return def download_weight_total_repair_exec(): """Repair download weight total ranking method""" write_message("Repairing for this ranking method is not defined. Skipping.") return def file_similarity_by_times_downloaded_repair_exec(): """Repair file similarity by times downloaded ranking method""" write_message("Repairing for this ranking method is not defined. Skipping.") return def single_tag_rank_method_repair_exec(): """Repair single tag ranking method""" write_message("Repairing for this ranking method is not defined. Skipping.") return def citation_exec(rank_method_code, name, config): """Rank method for citation analysis""" #first check if this is a specific task begin_date = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) if task_get_option("cmd") == "print-missing": num = task_get_option("num") print_missing(num) else: dict = get_citation_weight(rank_method_code, config) if dict: if task_get_option("id") or task_get_option("collection") or \ task_get_option("modified"): # user have asked to citation-index specific records # only, so we should not update citation indexer's # last run time stamp information begin_date = None intoDB(dict, begin_date, rank_method_code) else: write_message("No need to update the indexes for citations.") def download_weight_filtering_user(run): return bibrank_engine(run) def download_weight_total(run): return bibrank_engine(run) def file_similarity_by_times_downloaded(run): return bibrank_engine(run) def download_weight_filtering_user_exec (rank_method_code, name, config): """Ranking by number of downloads per User. Only one full Text Download is taken in account for one specific userIP address""" begin_date = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) time1 = time.time() dic = fromDB(rank_method_code) last_updated = get_lastupdated(rank_method_code) keys = new_downloads_to_index(last_updated) filter_downloads_per_hour(keys, last_updated) dic = get_download_weight_filtering_user(dic, keys) intoDB(dic, begin_date, rank_method_code) time2 = time.time() return {"time":time2-time1} def download_weight_total_exec(rank_method_code, name, config): """rankink by total number of downloads without check the user ip if users downloads 3 time the same full text document it has to be count as 3 downloads""" begin_date = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) time1 = time.time() dic = fromDB(rank_method_code) last_updated = get_lastupdated(rank_method_code) keys = new_downloads_to_index(last_updated) filter_downloads_per_hour(keys, last_updated) dic = get_download_weight_total(dic, keys) intoDB(dic, begin_date, rank_method_code) time2 = time.time() return {"time":time2-time1} def file_similarity_by_times_downloaded_exec(rank_method_code, name, config): """update dictionnary {recid:[(recid, nb page similarity), ()..]}""" begin_date = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) time1 = time.time() dic = fromDB(rank_method_code) last_updated = get_lastupdated(rank_method_code) keys = new_downloads_to_index(last_updated) filter_downloads_per_hour(keys, last_updated) dic = get_file_similarity_by_times_downloaded(dic, keys) intoDB(dic, begin_date, rank_method_code) time2 = time.time() return {"time":time2-time1} def single_tag_rank_method_exec(rank_method_code, name, config): """Creating the rank method data""" begin_date = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) rnkset = {} rnkset_old = fromDB(rank_method_code) rnkset_new = single_tag_rank(config) rnkset = union_dicts(rnkset_old, rnkset_new) intoDB(rnkset, begin_date, rank_method_code) def single_tag_rank(config): """Connect the given tag with the data from the kb file given""" write_message("Loading knowledgebase file", verbose=9) kb_data = {} records = [] write_message("Reading knowledgebase file: %s" % \ config.get(config.get("rank_method", "function"), "kb_src")) input = open(config.get(config.get("rank_method", "function"), "kb_src"), 'r') data = input.readlines() for line in data: if not line[0:1] == "#": kb_data[string.strip((string.split(string.strip(line), "---"))[0])] = (string.split(string.strip(line), "---"))[1] write_message("Number of lines read from knowledgebase file: %s" % len(kb_data)) tag = config.get(config.get("rank_method", "function"), "tag") tags = config.get(config.get("rank_method", "function"), "check_mandatory_tags").split(", ") if tags == ['']: tags = "" records = [] for (recids, recide) in options["recid_range"]: task_sleep_now_if_required(can_stop_too=True) write_message("......Processing records #%s-%s" % (recids, recide)) recs = run_sql("SELECT id_bibrec, value FROM bib%sx, bibrec_bib%sx WHERE tag=%%s AND id_bibxxx=id and id_bibrec >=%%s and id_bibrec<=%%s" % (tag[0:2], tag[0:2]), (tag, recids, recide)) valid = HitSet(trailing_bits=1) valid.discard(0) for key in tags: newset = HitSet() newset += [recid[0] for recid in (run_sql("SELECT id_bibrec FROM bib%sx, bibrec_bib%sx WHERE id_bibxxx=id AND tag=%%s AND id_bibxxx=id and id_bibrec >=%%s and id_bibrec<=%%s" % (tag[0:2], tag[0:2]), (key, recids, recide)))] valid.intersection_update(newset) if tags: recs = filter(lambda x: x[0] in valid, recs) records = records + list(recs) write_message("Number of records found with the necessary tags: %s" % len(records)) records = filter(lambda x: x[0] in options["validset"], records) rnkset = {} for key, value in records: if kb_data.has_key(value): if not rnkset.has_key(key): rnkset[key] = float(kb_data[value]) else: if kb_data.has_key(rnkset[key]) and float(kb_data[value]) > float((rnkset[key])[1]): rnkset[key] = float(kb_data[value]) else: rnkset[key] = 0 write_message("Number of records available in rank method: %s" % len(rnkset)) return rnkset def get_lastupdated(rank_method_code): """Get the last time the rank method was updated""" res = run_sql("SELECT rnkMETHOD.last_updated FROM rnkMETHOD WHERE name=%s", (rank_method_code, )) if res: return res[0][0] else: raise Exception("Is this the first run? Please do a complete update.") def intoDB(dict, date, rank_method_code): """Insert the rank method data into the database""" mid = run_sql("SELECT id from rnkMETHOD where name=%s", (rank_method_code, )) del_rank_method_codeDATA(rank_method_code) serdata = serialize_via_marshal(dict); midstr = str(mid[0][0]); run_sql("INSERT INTO rnkMETHODDATA(id_rnkMETHOD, relevance_data) VALUES (%s,%s)", (midstr, serdata,)) if date: run_sql("UPDATE rnkMETHOD SET last_updated=%s WHERE name=%s", (date, rank_method_code)) def fromDB(rank_method_code): """Get the data for a rank method""" id = run_sql("SELECT id from rnkMETHOD where name=%s", (rank_method_code, )) res = run_sql("SELECT relevance_data FROM rnkMETHODDATA WHERE id_rnkMETHOD=%s", (id[0][0], )) if res: return deserialize_via_marshal(res[0][0]) else: return {} def del_rank_method_codeDATA(rank_method_code): """Delete the data for a rank method""" id = run_sql("SELECT id from rnkMETHOD where name=%s", (rank_method_code, )) run_sql("DELETE FROM rnkMETHODDATA WHERE id_rnkMETHOD=%s", (id[0][0], )) def del_recids(rank_method_code, range_rec): """Delete some records from the rank method""" id = run_sql("SELECT id from rnkMETHOD where name=%s", (rank_method_code, )) res = run_sql("SELECT relevance_data FROM rnkMETHODDATA WHERE id_rnkMETHOD=%s", (id[0][0], )) if res: rec_dict = deserialize_via_marshal(res[0][0]) write_message("Old size: %s" % len(rec_dict)) for (recids, recide) in range_rec: for i in range(int(recids), int(recide)): if rec_dict.has_key(i): del rec_dict[i] write_message("New size: %s" % len(rec_dict)) intoDB(rec_dict, begin_date, rank_method_code) else: write_message("Create before deleting!") def union_dicts(dict1, dict2): "Returns union of the two dicts." union_dict = {} for (key, value) in dict1.iteritems(): union_dict[key] = value for (key, value) in dict2.iteritems(): union_dict[key] = value return union_dict def rank_method_code_statistics(rank_method_code): """Print statistics""" method = fromDB(rank_method_code) max = ('', -999999) maxcount = 0 min = ('', 999999) mincount = 0 for (recID, value) in method.iteritems(): if value < min and value > 0: min = value if value > max: max = value for (recID, value) in method.iteritems(): if value == min: mincount += 1 if value == max: maxcount += 1 write_message("Showing statistic for selected method") write_message("Method name: %s" % getName(rank_method_code)) write_message("Short name: %s" % rank_method_code) write_message("Last run: %s" % get_lastupdated(rank_method_code)) write_message("Number of records: %s" % len(method)) write_message("Lowest value: %s - Number of records: %s" % (min, mincount)) write_message("Highest value: %s - Number of records: %s" % (max, maxcount)) write_message("Divided into 10 sets:") for i in range(1, 11): setcount = 0 distinct_values = {} lower = -1.0 + ((float(max + 1) / 10)) * (i - 1) upper = -1.0 + ((float(max + 1) / 10)) * i for (recID, value) in method.iteritems(): if value >= lower and value <= upper: setcount += 1 distinct_values[value] = 1 write_message("Set %s (%s-%s) %s Distinct values: %s" % (i, lower, upper, len(distinct_values), setcount)) def check_method(rank_method_code): write_message("Checking rank method...") if len(fromDB(rank_method_code)) == 0: write_message("Rank method not yet executed, please run it to create the necessary data.") else: if len(add_recIDs_by_date(rank_method_code)) > 0: write_message("Records modified, update recommended") else: write_message("No records modified, update not necessary") def bibrank_engine(run): """Run the indexing task. Return 1 in case of success and 0 in case of failure. """ - - try: - import psyco - psyco.bind(single_tag_rank) - psyco.bind(single_tag_rank_method_exec) - psyco.bind(serialize_via_marshal) - psyco.bind(deserialize_via_marshal) - except StandardError, e: - pass - startCreate = time.time() try: options["run"] = [] options["run"].append(run) for rank_method_code in options["run"]: task_sleep_now_if_required(can_stop_too=True) cfg_name = getName(rank_method_code) write_message("Running rank method: %s." % cfg_name) file = CFG_ETCDIR + "/bibrank/" + rank_method_code + ".cfg" config = ConfigParser.ConfigParser() try: config.readfp(open(file)) except StandardError, e: write_message("Cannot find configurationfile: %s" % file, sys.stderr) raise StandardError cfg_short = rank_method_code cfg_function = config.get("rank_method", "function") + "_exec" cfg_repair_function = config.get("rank_method", "function") + "_repair_exec" cfg_name = getName(cfg_short) options["validset"] = get_valid_range(rank_method_code) if task_get_option("collection"): l_of_colls = string.split(task_get_option("collection"), ", ") recIDs = perform_request_search(c=l_of_colls) recIDs_range = [] for recID in recIDs: recIDs_range.append([recID, recID]) options["recid_range"] = recIDs_range elif task_get_option("id"): options["recid_range"] = task_get_option("id") elif task_get_option("modified"): options["recid_range"] = add_recIDs_by_date(rank_method_code, task_get_option("modified")) elif task_get_option("last_updated"): options["recid_range"] = add_recIDs_by_date(rank_method_code) else: write_message("No records specified, updating all", verbose=2) min_id = run_sql("SELECT min(id) from bibrec")[0][0] max_id = run_sql("SELECT max(id) from bibrec")[0][0] options["recid_range"] = [[min_id, max_id]] if task_get_option("quick") == "no": write_message("Recalculate parameter not used, parameter ignored.", verbose=9) if task_get_option("cmd") == "del": del_recids(cfg_short, options["recid_range"]) elif task_get_option("cmd") == "add": func_object = globals().get(cfg_function) func_object(rank_method_code, cfg_name, config) elif task_get_option("cmd") == "stat": rank_method_code_statistics(rank_method_code) elif task_get_option("cmd") == "check": check_method(rank_method_code) elif task_get_option("cmd") == "print-missing": func_object = globals().get(cfg_function) func_object(rank_method_code, cfg_name, config) elif task_get_option("cmd") == "repair": func_object = globals().get(cfg_repair_function) func_object() else: write_message("Invalid command found processing %s" % rank_method_code, sys.stderr) raise StandardError except StandardError, e: write_message("\nException caught: %s" % e, sys.stderr) register_exception() raise StandardError if task_get_option("verbose"): showtime((time.time() - startCreate)) return 1 def get_valid_range(rank_method_code): """Return a range of records""" write_message("Getting records from collections enabled for rank method.", verbose=9) res = run_sql("SELECT collection.name FROM collection, collection_rnkMETHOD, rnkMETHOD WHERE collection.id=id_collection and id_rnkMETHOD=rnkMETHOD.id and rnkMETHOD.name=%s", (rank_method_code, )) l_of_colls = [] for coll in res: l_of_colls.append(coll[0]) if len(l_of_colls) > 0: recIDs = perform_request_search(c=l_of_colls) else: recIDs = [] valid = HitSet() valid += recIDs return valid def add_recIDs_by_date(rank_method_code, dates=""): """Return recID range from records modified between DATES[0] and DATES[1]. If DATES is not set, then add records modified since the last run of the ranking method RANK_METHOD_CODE. """ if not dates: try: dates = (get_lastupdated(rank_method_code), '') except Exception: dates = ("0000-00-00 00:00:00", '') if dates[0] is None: dates = ("0000-00-00 00:00:00", '') query = """SELECT b.id FROM bibrec AS b WHERE b.modification_date >= %s""" if dates[1]: query += " and b.modification_date <= %s" query += " ORDER BY b.id ASC""" if dates[1]: res = run_sql(query, (dates[0], dates[1])) else: res = run_sql(query, (dates[0], )) alist = create_range_list([row[0] for row in res]) if not alist: write_message("No new records added since last time method was run") return alist def getName(rank_method_code, ln=CFG_SITE_LANG, type='ln'): """Returns the name of the method if it exists""" try: rnkid = run_sql("SELECT id FROM rnkMETHOD where name=%s", (rank_method_code, )) if rnkid: rnkid = str(rnkid[0][0]) res = run_sql("SELECT value FROM rnkMETHODNAME where type=%s and ln=%s and id_rnkMETHOD=%s", (type, ln, rnkid)) if not res: res = run_sql("SELECT value FROM rnkMETHODNAME WHERE ln=%s and id_rnkMETHOD=%s and type=%s", (CFG_SITE_LANG, rnkid, type)) if not res: return rank_method_code return res[0][0] else: raise Exception except Exception: write_message("Cannot run rank method, either given code for method is wrong, or it has not been added using the webinterface.") raise Exception def single_tag_rank_method(run): return bibrank_engine(run) def showtime(timeused): """Show time used for method""" write_message("Time used: %d second(s)." % timeused, verbose=9) def citation(run): return bibrank_engine(run) diff --git a/modules/bibrank/lib/bibrank_word_indexer.py b/modules/bibrank/lib/bibrank_word_indexer.py index 27c4f65ff..76af88031 100644 --- a/modules/bibrank/lib/bibrank_word_indexer.py +++ b/modules/bibrank/lib/bibrank_word_indexer.py @@ -1,1206 +1,1194 @@ ## This file is part of Invenio. ## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. __revision__ = "$Id$" import sys import time import urllib import math import re import ConfigParser from invenio.config import \ CFG_SITE_LANG, \ CFG_ETCDIR from invenio.search_engine import perform_request_search, strip_accents, wash_index_term from invenio.dbquery import run_sql, DatabaseError, serialize_via_marshal, deserialize_via_marshal from invenio.bibindex_engine_stemmer import is_stemmer_available_for_language, stem from invenio.bibindex_engine_stopwords import is_stopword from invenio.bibindex_engine import beautify_range_list, \ kill_sleepy_mysql_threads, create_range_list from invenio.bibtask import write_message, task_get_option, task_update_progress, \ task_update_status, task_sleep_now_if_required from invenio.intbitset import intbitset from invenio.errorlib import register_exception options = {} # global variable to hold task options ## safety parameters concerning DB thread-multiplication problem: CFG_CHECK_MYSQL_THREADS = 0 # to check or not to check the problem? CFG_MAX_MYSQL_THREADS = 50 # how many threads (connections) we consider as still safe CFG_MYSQL_THREAD_TIMEOUT = 20 # we'll kill threads that were sleeping for more than X seconds ## override urllib's default password-asking behaviour: class MyFancyURLopener(urllib.FancyURLopener): def prompt_user_passwd(self, host, realm): # supply some dummy credentials by default return ("mysuperuser", "mysuperpass") def http_error_401(self, url, fp, errcode, errmsg, headers): # do not bother with protected pages raise IOError, (999, 'unauthorized access') return None #urllib._urlopener = MyFancyURLopener() nb_char_in_line = 50 # for verbose pretty printing chunksize = 1000 # default size of chunks that the records will be treated by base_process_size = 4500 # process base size ## Dictionary merging functions def dict_union(list1, list2): "Returns union of the two dictionaries." union_dict = {} for (e, count) in list1.iteritems(): union_dict[e] = count for (e, count) in list2.iteritems(): if not union_dict.has_key(e): union_dict[e] = count else: union_dict[e] = (union_dict[e][0] + count[0], count[1]) #for (e, count) in list2.iteritems(): # list1[e] = (list1.get(e, (0, 0))[0] + count[0], count[1]) #return list1 return union_dict # tagToFunctions mapping. It offers an indirection level necesary for # indexing fulltext. The default is get_words_from_phrase tagToWordsFunctions = {} def get_words_from_phrase(phrase, weight, lang="", chars_punctuation=r"[\.\,\:\;\?\!\"]", chars_alphanumericseparators=r"[1234567890\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\@\[\\\]\^\_\`\{\|\}\~]", split=str.split): "Returns list of words from phrase 'phrase'." words = {} phrase = strip_accents(phrase) phrase = phrase.lower() #Getting rid of strange characters phrase = re.sub("é", 'e', phrase) phrase = re.sub("è", 'e', phrase) phrase = re.sub("à", 'a', phrase) phrase = re.sub(" ", ' ', phrase) phrase = re.sub("«", ' ', phrase) phrase = re.sub("»", ' ', phrase) phrase = re.sub("ê", ' ', phrase) phrase = re.sub("&", ' ', phrase) if phrase.find(" -1: #Most likely html, remove html code phrase = re.sub("(?s)<[^>]*>|&#?\w+;", ' ', phrase) #removes http links phrase = re.sub("(?s)http://[^( )]*", '', phrase) phrase = re.sub(chars_punctuation, ' ', phrase) #By doing this like below, characters standing alone, like c a b is not added to the inedx, but when they are together with characters like c++ or c$ they are added. for word in split(phrase): if options["remove_stopword"] == "True" and not is_stopword(word, 1) and check_term(word, 0): if lang and lang !="none" and options["use_stemming"]: word = stem(word, lang) if not words.has_key(word): words[word] = (0, 0) else: if not words.has_key(word): words[word] = (0, 0) words[word] = (words[word][0] + weight, 0) elif options["remove_stopword"] == "True" and not is_stopword(word, 1): phrase = re.sub(chars_alphanumericseparators, ' ', word) for word_ in split(phrase): if lang and lang !="none" and options["use_stemming"]: word_ = stem(word_, lang) if word_: if not words.has_key(word_): words[word_] = (0,0) words[word_] = (words[word_][0] + weight, 0) return words class WordTable: "A class to hold the words table." def __init__(self, tablename, fields_to_index, separators="[^\s]"): "Creates words table instance." self.tablename = tablename self.recIDs_in_mem = [] self.fields_to_index = fields_to_index self.separators = separators self.value = {} def get_field(self, recID, tag): """Returns list of values of the MARC-21 'tag' fields for the record 'recID'.""" out = [] bibXXx = "bib" + tag[0] + tag[1] + "x" bibrec_bibXXx = "bibrec_" + bibXXx query = """SELECT value FROM %s AS b, %s AS bb WHERE bb.id_bibrec=%s AND bb.id_bibxxx=b.id AND tag LIKE '%s'""" % (bibXXx, bibrec_bibXXx, recID, tag); res = run_sql(query) for row in res: out.append(row[0]) return out def clean(self): "Cleans the words table." self.value={} def put_into_db(self, mode="normal"): """Updates the current words table in the corresponding DB rnkWORD table. Mode 'normal' means normal execution, mode 'emergency' means words index reverting to old state. """ write_message("%s %s wordtable flush started" % (self.tablename,mode)) write_message('...updating %d words into %sR started' % \ (len(self.value), self.tablename[:-1])) task_update_progress("%s flushed %d/%d words" % (self.tablename, 0, len(self.value))) self.recIDs_in_mem = beautify_range_list(self.recIDs_in_mem) if mode == "normal": for group in self.recIDs_in_mem: query = """UPDATE %sR SET type='TEMPORARY' WHERE id_bibrec BETWEEN '%d' AND '%d' AND type='CURRENT'""" % \ (self.tablename[:-1], group[0], group[1]) write_message(query, verbose=9) run_sql(query) nb_words_total = len(self.value) nb_words_report = int(nb_words_total/10) nb_words_done = 0 for word in self.value.keys(): self.put_word_into_db(word, self.value[word]) nb_words_done += 1 if nb_words_report!=0 and ((nb_words_done % nb_words_report) == 0): write_message('......processed %d/%d words' % (nb_words_done, nb_words_total)) task_update_progress("%s flushed %d/%d words" % (self.tablename, nb_words_done, nb_words_total)) write_message('...updating %d words into %s ended' % \ (nb_words_total, self.tablename), verbose=9) #if options["verbose"]: # write_message('...updating reverse table %sR started' % self.tablename[:-1]) if mode == "normal": for group in self.recIDs_in_mem: query = """UPDATE %sR SET type='CURRENT' WHERE id_bibrec BETWEEN '%d' AND '%d' AND type='FUTURE'""" % \ (self.tablename[:-1], group[0], group[1]) write_message(query, verbose=9) run_sql(query) query = """DELETE FROM %sR WHERE id_bibrec BETWEEN '%d' AND '%d' AND type='TEMPORARY'""" % \ (self.tablename[:-1], group[0], group[1]) write_message(query, verbose=9) run_sql(query) write_message('End of updating wordTable into %s' % self.tablename, verbose=9) elif mode == "emergency": write_message("emergency") for group in self.recIDs_in_mem: query = """UPDATE %sR SET type='CURRENT' WHERE id_bibrec BETWEEN '%d' AND '%d' AND type='TEMPORARY'""" % \ (self.tablename[:-1], group[0], group[1]) write_message(query, verbose=9) run_sql(query) query = """DELETE FROM %sR WHERE id_bibrec BETWEEN '%d' AND '%d' AND type='FUTURE'""" % \ (self.tablename[:-1], group[0], group[1]) write_message(query, verbose=9) run_sql(query) write_message('End of emergency flushing wordTable into %s' % self.tablename, verbose=9) #if options["verbose"]: # write_message('...updating reverse table %sR ended' % self.tablename[:-1]) self.clean() self.recIDs_in_mem = [] write_message("%s %s wordtable flush ended" % (self.tablename, mode)) task_update_progress("%s flush ended" % (self.tablename)) def load_old_recIDs(self,word): """Load existing hitlist for the word from the database index files.""" query = "SELECT hitlist FROM %s WHERE term=%%s" % self.tablename res = run_sql(query, (word,)) if res: return deserialize_via_marshal(res[0][0]) else: return None def merge_with_old_recIDs(self,word,recIDs, set): """Merge the system numbers stored in memory (hash of recIDs with value[0] > 0 or -1 according to whether to add/delete them) with those stored in the database index and received in set universe of recIDs for the given word. Return 0 in case no change was done to SET, return 1 in case SET was changed. """ set_changed_p = 0 for recID,sign in recIDs.iteritems(): if sign[0] == -1 and set.has_key(recID): # delete recID if existent in set and if marked as to be deleted del set[recID] set_changed_p = 1 elif sign[0] > -1 and not set.has_key(recID): # add recID if not existent in set and if marked as to be added set[recID] = sign set_changed_p = 1 elif sign[0] > -1 and sign[0] != set[recID][0]: set[recID] = sign set_changed_p = 1 return set_changed_p def put_word_into_db(self, word, recIDs, split=str.split): """Flush a single word to the database and delete it from memory""" set = self.load_old_recIDs(word) #write_message("%s %s" % (word, self.value[word])) if set is not None: # merge the word recIDs found in memory: options["modified_words"][word] = 1 if not self.merge_with_old_recIDs(word, recIDs, set): # nothing to update: write_message("......... unchanged hitlist for ``%s''" % word, verbose=9) pass else: # yes there were some new words: write_message("......... updating hitlist for ``%s''" % word, verbose=9) run_sql("UPDATE %s SET hitlist=%%s WHERE term=%%s" % self.tablename, (serialize_via_marshal(set), word)) else: # the word is new, will create new set: write_message("......... inserting hitlist for ``%s''" % word, verbose=9) set = self.value[word] if len(set) > 0: #new word, add to list options["modified_words"][word] = 1 try: run_sql("INSERT INTO %s (term, hitlist) VALUES (%%s, %%s)" % self.tablename, (word, serialize_via_marshal(set))) except Exception, e: ## FIXME: This is for debugging encoding errors register_exception(prefix="Error when putting the term '%s' into db (hitlist=%s): %s\n" % (repr(word), set, e), alert_admin=True) if not set: # never store empty words run_sql("DELETE from %s WHERE term=%%s" % self.tablename, (word,)) del self.value[word] def display(self): "Displays the word table." keys = self.value.keys() keys.sort() for k in keys: write_message("%s: %s" % (k, self.value[k])) def count(self): "Returns the number of words in the table." return len(self.value) def info(self): "Prints some information on the words table." write_message("The words table contains %d words." % self.count()) def lookup_words(self, word=""): "Lookup word from the words table." if not word: done = 0 while not done: try: word = raw_input("Enter word: ") done = 1 except (EOFError, KeyboardInterrupt): return if self.value.has_key(word): write_message("The word '%s' is found %d times." \ % (word, len(self.value[word]))) else: write_message("The word '%s' does not exist in the word file."\ % word) def update_last_updated(self, rank_method_code, starting_time=None): """Update last_updated column of the index table in the database. Puts starting time there so that if the task was interrupted for record download, the records will be reindexed next time.""" if starting_time is None: return None write_message("updating last_updated to %s..." % starting_time, verbose=9) return run_sql("UPDATE rnkMETHOD SET last_updated=%s WHERE name=%s", (starting_time, rank_method_code,)) def add_recIDs(self, recIDs): """Fetches records which id in the recIDs arange list and adds them to the wordTable. The recIDs arange list is of the form: [[i1_low,i1_high],[i2_low,i2_high], ..., [iN_low,iN_high]]. """ global chunksize flush_count = 0 records_done = 0 records_to_go = 0 for arange in recIDs: records_to_go = records_to_go + arange[1] - arange[0] + 1 time_started = time.time() # will measure profile time for arange in recIDs: i_low = arange[0] chunksize_count = 0 while i_low <= arange[1]: # calculate chunk group of recIDs and treat it: i_high = min(i_low+task_get_option("flush")-flush_count-1,arange[1]) i_high = min(i_low+chunksize-chunksize_count-1, i_high) try: self.chk_recID_range(i_low, i_high) except StandardError, e: write_message("Exception caught: %s" % e, sys.stderr) register_exception() task_update_status("ERROR") sys.exit(1) write_message("%s adding records #%d-#%d started" % \ (self.tablename, i_low, i_high)) if CFG_CHECK_MYSQL_THREADS: kill_sleepy_mysql_threads() task_update_progress("%s adding recs %d-%d" % (self.tablename, i_low, i_high)) self.del_recID_range(i_low, i_high) just_processed = self.add_recID_range(i_low, i_high) flush_count = flush_count + i_high - i_low + 1 chunksize_count = chunksize_count + i_high - i_low + 1 records_done = records_done + just_processed write_message("%s adding records #%d-#%d ended " % \ (self.tablename, i_low, i_high)) if chunksize_count >= chunksize: chunksize_count = 0 # flush if necessary: if flush_count >= task_get_option("flush"): self.put_into_db() self.clean() write_message("%s backing up" % (self.tablename)) flush_count = 0 self.log_progress(time_started,records_done,records_to_go) # iterate: i_low = i_high + 1 if flush_count > 0: self.put_into_db() self.log_progress(time_started,records_done,records_to_go) def add_recIDs_by_date(self, dates=""): """Add recIDs modified between DATES[0] and DATES[1]. If DATES is not set, then add records modified since the last run of the ranking method. """ if not dates: write_message("Using the last update time for the rank method") query = """SELECT last_updated FROM rnkMETHOD WHERE name='%s' """ % options["current_run"] res = run_sql(query) if not res: return if not res[0][0]: dates = ("0000-00-00",'') else: dates = (res[0][0],'') query = """SELECT b.id FROM bibrec AS b WHERE b.modification_date >= '%s'""" % dates[0] if dates[1]: query += "and b.modification_date <= '%s'" % dates[1] query += " ORDER BY b.id ASC""" res = run_sql(query) alist = create_range_list([row[0] for row in res]) if not alist: write_message( "No new records added. %s is up to date" % self.tablename) else: self.add_recIDs(alist) return alist def add_recID_range(self, recID1, recID2): """Add records from RECID1 to RECID2.""" wlist = {} normalize = {} self.recIDs_in_mem.append([recID1,recID2]) # secondly fetch all needed tags: for (tag, weight, lang) in self.fields_to_index: if tag in tagToWordsFunctions.keys(): get_words_function = tagToWordsFunctions[tag] else: get_words_function = get_words_from_phrase bibXXx = "bib" + tag[0] + tag[1] + "x" bibrec_bibXXx = "bibrec_" + bibXXx query = """SELECT bb.id_bibrec,b.value FROM %s AS b, %s AS bb WHERE bb.id_bibrec BETWEEN %d AND %d AND bb.id_bibxxx=b.id AND tag LIKE '%s'""" % (bibXXx, bibrec_bibXXx, recID1, recID2, tag) res = run_sql(query) nb_total_to_read = len(res) verbose_idx = 0 # for verbose pretty printing for row in res: recID, phrase = row if recID in options["validset"]: if not wlist.has_key(recID): wlist[recID] = {} new_words = get_words_function(phrase, weight, lang) # ,self.separators wlist[recID] = dict_union(new_words,wlist[recID]) # were there some words for these recIDs found? if len(wlist) == 0: return 0 recIDs = wlist.keys() for recID in recIDs: # was this record marked as deleted? if "DELETED" in self.get_field(recID, "980__c"): wlist[recID] = {} write_message("... record %d was declared deleted, removing its word list" % recID, verbose=9) write_message("... record %d, termlist: %s" % (recID, wlist[recID]), verbose=9) # put words into reverse index table with FUTURE status: for recID in recIDs: run_sql("INSERT INTO %sR (id_bibrec,termlist,type) VALUES (%%s,%%s,'FUTURE')" % self.tablename[:-1], (recID, serialize_via_marshal(wlist[recID]))) # ... and, for new records, enter the CURRENT status as empty: try: run_sql("INSERT INTO %sR (id_bibrec,termlist,type) VALUES (%%s,%%s,'CURRENT')" % self.tablename[:-1], (recID, serialize_via_marshal([]))) except DatabaseError: # okay, it's an already existing record, no problem pass # put words into memory word list: put = self.put for recID in recIDs: for (w, count) in wlist[recID].iteritems(): put(recID, w, count) return len(recIDs) def log_progress(self, start, done, todo): """Calculate progress and store it. start: start time, done: records processed, todo: total number of records""" time_elapsed = time.time() - start # consistency check if time_elapsed == 0 or done > todo: return time_recs_per_min = done/(time_elapsed/60.0) write_message("%d records took %.1f seconds to complete.(%1.f recs/min)"\ % (done, time_elapsed, time_recs_per_min)) if time_recs_per_min: write_message("Estimated runtime: %.1f minutes" % \ ((todo-done)/time_recs_per_min)) def put(self, recID, word, sign): "Adds/deletes a word to the word list." try: word = wash_index_term(word) if self.value.has_key(word): # the word 'word' exist already: update sign self.value[word][recID] = sign # PROBLEM ? else: self.value[word] = {recID: sign} except: write_message("Error: Cannot put word %s with sign %d for recID %s." % (word, sign, recID)) def del_recIDs(self, recIDs): """Fetches records which id in the recIDs range list and adds them to the wordTable. The recIDs range list is of the form: [[i1_low,i1_high],[i2_low,i2_high], ..., [iN_low,iN_high]]. """ count = 0 for range in recIDs: self.del_recID_range(range[0],range[1]) count = count + range[1] - range[0] self.put_into_db() def del_recID_range(self, low, high): """Deletes records with 'recID' system number between low and high from memory words index table.""" write_message("%s fetching existing words for records #%d-#%d started" % \ (self.tablename, low, high), verbose=3) self.recIDs_in_mem.append([low,high]) query = """SELECT id_bibrec,termlist FROM %sR as bb WHERE bb.id_bibrec BETWEEN '%d' AND '%d'""" % (self.tablename[:-1], low, high) recID_rows = run_sql(query) for recID_row in recID_rows: recID = recID_row[0] wlist = deserialize_via_marshal(recID_row[1]) for word in wlist: self.put(recID, word, (-1, 0)) write_message("%s fetching existing words for records #%d-#%d ended" % \ (self.tablename, low, high), verbose=3) def report_on_table_consistency(self): """Check reverse words index tables (e.g. rnkWORD01R) for interesting states such as 'TEMPORARY' state. Prints small report (no of words, no of bad words). """ # find number of words: query = """SELECT COUNT(*) FROM %s""" % (self.tablename) res = run_sql(query, None, 1) if res: nb_words = res[0][0] else: nb_words = 0 # find number of records: query = """SELECT COUNT(DISTINCT(id_bibrec)) FROM %sR""" % (self.tablename[:-1]) res = run_sql(query, None, 1) if res: nb_records = res[0][0] else: nb_records = 0 # report stats: write_message("%s contains %d words from %d records" % (self.tablename, nb_words, nb_records)) # find possible bad states in reverse tables: query = """SELECT COUNT(DISTINCT(id_bibrec)) FROM %sR WHERE type <> 'CURRENT'""" % (self.tablename[:-1]) res = run_sql(query) if res: nb_bad_records = res[0][0] else: nb_bad_records = 999999999 if nb_bad_records: write_message("EMERGENCY: %s needs to repair %d of %d index records" % \ (self.tablename, nb_bad_records, nb_records)) else: write_message("%s is in consistent state" % (self.tablename)) return nb_bad_records def repair(self): """Repair the whole table""" # find possible bad states in reverse tables: query = """SELECT COUNT(DISTINCT(id_bibrec)) FROM %sR WHERE type <> 'CURRENT'""" % (self.tablename[:-1]) res = run_sql(query, None, 1) if res: nb_bad_records = res[0][0] else: nb_bad_records = 0 # find number of records: query = """SELECT COUNT(DISTINCT(id_bibrec)) FROM %sR""" % (self.tablename[:-1]) res = run_sql(query) if res: nb_records = res[0][0] else: nb_records = 0 if nb_bad_records == 0: return query = """SELECT id_bibrec FROM %sR WHERE type <> 'CURRENT' ORDER BY id_bibrec""" \ % (self.tablename[:-1]) res = run_sql(query) recIDs = create_range_list([row[0] for row in res]) flush_count = 0 records_done = 0 records_to_go = 0 for range in recIDs: records_to_go = records_to_go + range[1] - range[0] + 1 time_started = time.time() # will measure profile time for range in recIDs: i_low = range[0] chunksize_count = 0 while i_low <= range[1]: # calculate chunk group of recIDs and treat it: i_high = min(i_low+task_get_option("flush")-flush_count-1,range[1]) i_high = min(i_low+chunksize-chunksize_count-1, i_high) try: self.fix_recID_range(i_low, i_high) except StandardError, e: write_message("Exception caught: %s" % e, sys.stderr) register_exception() task_update_status("ERROR") sys.exit(1) flush_count = flush_count + i_high - i_low + 1 chunksize_count = chunksize_count + i_high - i_low + 1 records_done = records_done + i_high - i_low + 1 if chunksize_count >= chunksize: chunksize_count = 0 # flush if necessary: if flush_count >= task_get_option("flush"): self.put_into_db("emergency") self.clean() flush_count = 0 self.log_progress(time_started,records_done,records_to_go) # iterate: i_low = i_high + 1 if flush_count > 0: self.put_into_db("emergency") self.log_progress(time_started,records_done,records_to_go) write_message("%s inconsistencies repaired." % self.tablename) def chk_recID_range(self, low, high): """Check if the reverse index table is in proper state""" ## check db query = """SELECT COUNT(*) FROM %sR WHERE type <> 'CURRENT' AND id_bibrec BETWEEN '%d' AND '%d'""" % (self.tablename[:-1], low, high) res = run_sql(query, None, 1) if res[0][0]==0: write_message("%s for %d-%d is in consistent state"%(self.tablename,low,high)) return # okay, words table is consistent ## inconsistency detected! write_message("EMERGENCY: %s inconsistencies detected..." % self.tablename) write_message("""EMERGENCY: Errors found. You should check consistency of the %s - %sR tables.\nRunning 'bibrank --repair' is recommended.""" \ % (self.tablename, self.tablename[:-1])) raise StandardError def fix_recID_range(self, low, high): """Try to fix reverse index database consistency (e.g. table rnkWORD01R) in the low,high doc-id range. Possible states for a recID follow: CUR TMP FUT: very bad things have happened: warn! CUR TMP : very bad things have happened: warn! CUR FUT: delete FUT (crash before flushing) CUR : database is ok TMP FUT: add TMP to memory and del FUT from memory flush (revert to old state) TMP : very bad things have happened: warn! FUT: very bad things have happended: warn! """ state = {} query = "SELECT id_bibrec,type FROM %sR WHERE id_bibrec BETWEEN '%d' AND '%d'"\ % (self.tablename[:-1], low, high) res = run_sql(query) for row in res: if not state.has_key(row[0]): state[row[0]]=[] state[row[0]].append(row[1]) ok = 1 # will hold info on whether we will be able to repair for recID in state.keys(): if not 'TEMPORARY' in state[recID]: if 'FUTURE' in state[recID]: if 'CURRENT' not in state[recID]: write_message("EMERGENCY: Index record %d is in inconsistent state. Can't repair it" % recID) ok = 0 else: write_message("EMERGENCY: Inconsistency in index record %d detected" % recID) query = """DELETE FROM %sR WHERE id_bibrec='%d'""" % (self.tablename[:-1], recID) run_sql(query) write_message("EMERGENCY: Inconsistency in index record %d repaired." % recID) else: if 'FUTURE' in state[recID] and not 'CURRENT' in state[recID]: self.recIDs_in_mem.append([recID,recID]) # Get the words file query = """SELECT type,termlist FROM %sR WHERE id_bibrec='%d'""" % (self.tablename[:-1], recID) write_message(query, verbose=9) res = run_sql(query) for row in res: wlist = deserialize_via_marshal(row[1]) write_message("Words are %s " % wlist, verbose=9) if row[0] == 'TEMPORARY': sign = 1 else: sign = -1 for word in wlist: self.put(recID, word, wlist[word]) else: write_message("EMERGENCY: %s for %d is in inconsistent state. Couldn't repair it." % (self.tablename, recID)) ok = 0 if not ok: write_message("""EMERGENCY: Unrepairable errors found. You should check consistency of the %s - %sR tables. Deleting affected TEMPORARY and FUTURE entries from these tables is recommended; see the BibIndex Admin Guide. (The repairing procedure is similar for bibrank word indexes.)""" % (self.tablename, self.tablename[:-1])) raise StandardError def word_index(run): """Run the indexing task. The row argument is the BibSched task queue row, containing if, arguments, etc. Return 1 in case of success and 0 in case of failure. """ - - ## import optional modules: - try: - import psyco - psyco.bind(get_words_from_phrase) - psyco.bind(WordTable.merge_with_old_recIDs) - psyco.bind(update_rnkWORD) - psyco.bind(check_rnkWORD) - except StandardError,e: - print "Warning: Psyco", e - pass - global languages max_recid = 0 res = run_sql("SELECT max(id) FROM bibrec") if res and res[0][0]: max_recid = int(res[0][0]) options["run"] = [] options["run"].append(run) for rank_method_code in options["run"]: task_sleep_now_if_required(can_stop_too=True) method_starting_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) write_message("Running rank method: %s" % getName(rank_method_code)) try: file = CFG_ETCDIR + "/bibrank/" + rank_method_code + ".cfg" config = ConfigParser.ConfigParser() config.readfp(open(file)) except StandardError, e: write_message("Cannot find configurationfile: %s" % file, sys.stderr) raise StandardError options["current_run"] = rank_method_code options["modified_words"] = {} options["table"] = config.get(config.get("rank_method", "function"), "table") options["use_stemming"] = config.get(config.get("rank_method","function"),"stemming") options["remove_stopword"] = config.get(config.get("rank_method","function"),"stopword") tags = get_tags(config) #get the tags to include options["validset"] = get_valid_range(rank_method_code) #get the records from the collections the method is enabled for function = config.get("rank_method","function") wordTable = WordTable(options["table"], tags) wordTable.report_on_table_consistency() try: if task_get_option("cmd") == "del": if task_get_option("id"): wordTable.del_recIDs(task_get_option("id")) task_sleep_now_if_required(can_stop_too=True) elif task_get_option("collection"): l_of_colls = task_get_option("collection").split(",") recIDs = perform_request_search(c=l_of_colls) recIDs_range = [] for recID in recIDs: recIDs_range.append([recID,recID]) wordTable.del_recIDs(recIDs_range) task_sleep_now_if_required(can_stop_too=True) else: write_message("Missing IDs of records to delete from index %s.", wordTable.tablename, sys.stderr) raise StandardError elif task_get_option("cmd") == "add": if task_get_option("id"): wordTable.add_recIDs(task_get_option("id")) task_sleep_now_if_required(can_stop_too=True) elif task_get_option("collection"): l_of_colls = task_get_option("collection").split(",") recIDs = perform_request_search(c=l_of_colls) recIDs_range = [] for recID in recIDs: recIDs_range.append([recID,recID]) wordTable.add_recIDs(recIDs_range) task_sleep_now_if_required(can_stop_too=True) elif task_get_option("last_updated"): wordTable.add_recIDs_by_date("") # only update last_updated if run via automatic mode: wordTable.update_last_updated(rank_method_code, method_starting_time) task_sleep_now_if_required(can_stop_too=True) elif task_get_option("modified"): wordTable.add_recIDs_by_date(task_get_option("modified")) task_sleep_now_if_required(can_stop_too=True) else: wordTable.add_recIDs([[0,max_recid]]) task_sleep_now_if_required(can_stop_too=True) elif task_get_option("cmd") == "repair": wordTable.repair() check_rnkWORD(options["table"]) task_sleep_now_if_required(can_stop_too=True) elif task_get_option("cmd") == "check": check_rnkWORD(options["table"]) options["modified_words"] = {} task_sleep_now_if_required(can_stop_too=True) elif task_get_option("cmd") == "stat": rank_method_code_statistics(options["table"]) task_sleep_now_if_required(can_stop_too=True) else: write_message("Invalid command found processing %s" % \ wordTable.tablename, sys.stderr) raise StandardError update_rnkWORD(options["table"], options["modified_words"]) task_sleep_now_if_required(can_stop_too=True) except StandardError, e: register_exception(alert_admin=True) write_message("Exception caught: %s" % e, sys.stderr) sys.exit(1) wordTable.report_on_table_consistency() # We are done. State it in the database, close and quit return 1 def get_tags(config): """Get the tags that should be used creating the index and each tag's parameter""" tags = [] function = config.get("rank_method","function") i = 1 shown_error = 0 #try: if 1: while config.has_option(function,"tag%s"% i): tag = config.get(function, "tag%s" % i) tag = tag.split(",") tag[1] = int(tag[1].strip()) tag[2] = tag[2].strip() #check if stemmer for language is available if config.get(function, "stemming") and stem("information", "en") != "inform": if shown_error == 0: write_message("Warning: Stemming not working. Please check it out!") shown_error = 1 elif tag[2] and tag[2] != "none" and config.get(function,"stemming") and not is_stemmer_available_for_language(tag[2]): write_message("Warning: Stemming not available for language '%s'." % tag[2]) tags.append(tag) i += 1 #except Exception: # write_message("Could not read data from configuration file, please check for errors") # raise StandardError return tags def get_valid_range(rank_method_code): """Returns which records are valid for this rank method, according to which collections it is enabled for.""" #if options["verbose"] >=9: # write_message("Getting records from collections enabled for rank method.") #res = run_sql("SELECT collection.name FROM collection,collection_rnkMETHOD,rnkMETHOD WHERE collection.id=id_collection and id_rnkMETHOD=rnkMETHOD.id and rnkMETHOD.name='%s'" % rank_method_code) #l_of_colls = [] #for coll in res: # l_of_colls.append(coll[0]) #if len(l_of_colls) > 0: # recIDs = perform_request_search(c=l_of_colls) #else: # recIDs = [] valid = intbitset(trailing_bits=1) valid.discard(0) #valid.addlist(recIDs) return valid def check_term(term, termlength): """Check if term contains not allowed characters, or for any other reasons for not using this term.""" try: if len(term) <= termlength: return False reg = re.compile(r"[1234567890\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\@\[\\\]\^\_\`\{\|\}\~]") if re.search(reg, term): return False term = str.replace(term, "-", "") term = str.replace(term, ".", "") term = str.replace(term, ",", "") if int(term): return False except StandardError, e: pass return True def check_rnkWORD(table): """Checks for any problems in rnkWORD tables.""" i = 0 errors = {} termslist = run_sql("SELECT term FROM %s" % table) N = run_sql("select max(id_bibrec) from %sR" % table[:-1])[0][0] write_message("Checking integrity of rank values in %s" % table) terms = map(lambda x: x[0], termslist) while i < len(terms): query_params = () for j in range(i, ((i+5000)< len(terms) and (i+5000) or len(terms))): query_params += (terms[j],) terms_docs = run_sql("SELECT term, hitlist FROM %s WHERE term IN (%s)" % (table, (len(query_params)*"%s,")[:-1]), query_params) for (t, hitlist) in terms_docs: term_docs = deserialize_via_marshal(hitlist) if (term_docs.has_key("Gi") and term_docs["Gi"][1] == 0) or not term_docs.has_key("Gi"): write_message("ERROR: Missing value for term: %s (%s) in %s: %s" % (t, repr(t), table, len(term_docs))) errors[t] = 1 i += 5000 write_message("Checking integrity of rank values in %sR" % table[:-1]) i = 0 while i < N: docs_terms = run_sql("SELECT id_bibrec, termlist FROM %sR WHERE id_bibrec>=%s and id_bibrec<=%s" % (table[:-1], i, i+5000)) for (j, termlist) in docs_terms: termlist = deserialize_via_marshal(termlist) for (t, tf) in termlist.iteritems(): if tf[1] == 0 and not errors.has_key(t): errors[t] = 1 write_message("ERROR: Gi missing for record %s and term: %s (%s) in %s" % (j,t,repr(t), table)) terms_docs = run_sql("SELECT term, hitlist FROM %s WHERE term=%%s" % table, (t,)) termlist = deserialize_via_marshal(terms_docs[0][1]) i += 5000 if len(errors) == 0: write_message("No direct errors found, but nonconsistent data may exist.") else: write_message("%s errors found during integrity check, repair and rebalancing recommended." % len(errors)) options["modified_words"] = errors def rank_method_code_statistics(table): """Shows some statistics about this rank method.""" maxID = run_sql("select max(id) from %s" % table) maxID = maxID[0][0] terms = {} Gi = {} write_message("Showing statistics of terms in index:") write_message("Important: For the 'Least used terms', the number of terms is shown first, and the number of occurences second.") write_message("Least used terms---Most important terms---Least important terms") i = 0 while i < maxID: terms_docs=run_sql("SELECT term, hitlist FROM %s WHERE id>= %s and id < %s" % (table, i, i + 10000)) for (t, hitlist) in terms_docs: term_docs=deserialize_via_marshal(hitlist) terms[len(term_docs)] = terms.get(len(term_docs), 0) + 1 if term_docs.has_key("Gi"): Gi[t] = term_docs["Gi"] i=i + 10000 terms=terms.items() terms.sort(lambda x, y: cmp(y[1], x[1])) Gi=Gi.items() Gi.sort(lambda x, y: cmp(y[1], x[1])) for i in range(0, 20): write_message("%s/%s---%s---%s" % (terms[i][0],terms[i][1], Gi[i][0],Gi[len(Gi) - i - 1][0])) def update_rnkWORD(table, terms): """Updates rnkWORDF and rnkWORDR with Gi and Nj values. For each term in rnkWORDF, a Gi value for the term is added. And for each term in each document, the Nj value for that document is added. In rnkWORDR, the Gi value for each term in each document is added. For description on how things are computed, look in the hacking docs. table - name of forward index to update terms - modified terms""" stime = time.time() Gi = {} Nj = {} N = run_sql("select count(id_bibrec) from %sR" % table[:-1])[0][0] if len(terms) == 0 and task_get_option("quick") == "yes": write_message("No terms to process, ending...") return "" elif task_get_option("quick") == "yes": #not used -R option, fast calculation (not accurate) write_message("Beginning post-processing of %s terms" % len(terms)) #Locating all documents related to the modified/new/deleted terms, if fast update, #only take into account new/modified occurences write_message("Phase 1: Finding records containing modified terms") terms = terms.keys() i = 0 while i < len(terms): terms_docs = get_from_forward_index(terms, i, (i+5000), table) for (t, hitlist) in terms_docs: term_docs = deserialize_via_marshal(hitlist) if term_docs.has_key("Gi"): del term_docs["Gi"] for (j, tf) in term_docs.iteritems(): if (task_get_option("quick") == "yes" and tf[1] == 0) or task_get_option("quick") == "no": Nj[j] = 0 write_message("Phase 1: ......processed %s/%s terms" % ((i+5000>len(terms) and len(terms) or (i+5000)), len(terms))) i += 5000 write_message("Phase 1: Finished finding records containing modified terms") #Find all terms in the records found in last phase write_message("Phase 2: Finding all terms in affected records") records = Nj.keys() i = 0 while i < len(records): docs_terms = get_from_reverse_index(records, i, (i + 5000), table) for (j, termlist) in docs_terms: doc_terms = deserialize_via_marshal(termlist) for (t, tf) in doc_terms.iteritems(): Gi[t] = 0 write_message("Phase 2: ......processed %s/%s records " % ((i+5000>len(records) and len(records) or (i+5000)), len(records))) i += 5000 write_message("Phase 2: Finished finding all terms in affected records") else: #recalculate max_id = run_sql("SELECT MAX(id) FROM %s" % table) max_id = max_id[0][0] write_message("Beginning recalculation of %s terms" % max_id) terms = [] i = 0 while i < max_id: terms_docs = get_from_forward_index_with_id(i, (i+5000), table) for (t, hitlist) in terms_docs: Gi[t] = 0 term_docs = deserialize_via_marshal(hitlist) if term_docs.has_key("Gi"): del term_docs["Gi"] for (j, tf) in term_docs.iteritems(): Nj[j] = 0 write_message("Phase 1: ......processed %s/%s terms" % ((i+5000)>max_id and max_id or (i+5000), max_id)) i += 5000 write_message("Phase 1: Finished finding which records contains which terms") write_message("Phase 2: Jumping over..already done in phase 1 because of -R option") terms = Gi.keys() Gi = {} i = 0 if task_get_option("quick") == "no": #Calculating Fi and Gi value for each term write_message("Phase 3: Calculating importance of all affected terms") while i < len(terms): terms_docs = get_from_forward_index(terms, i, (i+5000), table) for (t, hitlist) in terms_docs: term_docs = deserialize_via_marshal(hitlist) if term_docs.has_key("Gi"): del term_docs["Gi"] Fi = 0 Gi[t] = 1 for (j, tf) in term_docs.iteritems(): Fi += tf[0] for (j, tf) in term_docs.iteritems(): if tf[0] != Fi: Gi[t] = Gi[t] + ((float(tf[0]) / Fi) * math.log(float(tf[0]) / Fi) / math.log(2)) / math.log(N) write_message("Phase 3: ......processed %s/%s terms" % ((i+5000>len(terms) and len(terms) or (i+5000)), len(terms))) i += 5000 write_message("Phase 3: Finished calculating importance of all affected terms") else: #Using existing Gi value instead of calculating a new one. Missing some accurancy. write_message("Phase 3: Getting approximate importance of all affected terms") while i < len(terms): terms_docs = get_from_forward_index(terms, i, (i+5000), table) for (t, hitlist) in terms_docs: term_docs = deserialize_via_marshal(hitlist) if term_docs.has_key("Gi"): Gi[t] = term_docs["Gi"][1] elif len(term_docs) == 1: Gi[t] = 1 else: Fi = 0 Gi[t] = 1 for (j, tf) in term_docs.iteritems(): Fi += tf[0] for (j, tf) in term_docs.iteritems(): if tf[0] != Fi: Gi[t] = Gi[t] + ((float(tf[0]) / Fi) * math.log(float(tf[0]) / Fi) / math.log(2)) / math.log(N) write_message("Phase 3: ......processed %s/%s terms" % ((i+5000>len(terms) and len(terms) or (i+5000)), len(terms))) i += 5000 write_message("Phase 3: Finished getting approximate importance of all affected terms") write_message("Phase 4: Calculating normalization value for all affected records and updating %sR" % table[:-1]) records = Nj.keys() i = 0 while i < len(records): #Calculating the normalization value for each document, and adding the Gi value to each term in each document. docs_terms = get_from_reverse_index(records, i, (i + 5000), table) for (j, termlist) in docs_terms: doc_terms = deserialize_via_marshal(termlist) try: for (t, tf) in doc_terms.iteritems(): if Gi.has_key(t): Nj[j] = Nj.get(j, 0) + math.pow(Gi[t] * (1 + math.log(tf[0])), 2) Git = int(math.floor(Gi[t]*100)) if Git >= 0: Git += 1 doc_terms[t] = (tf[0], Git) else: Nj[j] = Nj.get(j, 0) + math.pow(tf[1] * (1 + math.log(tf[0])), 2) Nj[j] = 1.0 / math.sqrt(Nj[j]) Nj[j] = int(Nj[j] * 100) if Nj[j] >= 0: Nj[j] += 1 run_sql("UPDATE %sR SET termlist=%%s WHERE id_bibrec=%%s" % table[:-1], (serialize_via_marshal(doc_terms), j)) except (ZeroDivisionError, OverflowError), e: ## This is to try to isolate division by zero errors. register_exception(prefix="Error when analysing the record %s (%s): %s\n" % (j, repr(docs_terms), e), alert_admin=True) write_message("Phase 4: ......processed %s/%s records" % ((i+5000>len(records) and len(records) or (i+5000)), len(records))) i += 5000 write_message("Phase 4: Finished calculating normalization value for all affected records and updating %sR" % table[:-1]) write_message("Phase 5: Updating %s with new normalization values" % table) i = 0 terms = Gi.keys() while i < len(terms): #Adding the Gi value to each term, and adding the normalization value to each term in each document. terms_docs = get_from_forward_index(terms, i, (i+5000), table) for (t, hitlist) in terms_docs: try: term_docs = deserialize_via_marshal(hitlist) if term_docs.has_key("Gi"): del term_docs["Gi"] for (j, tf) in term_docs.iteritems(): if Nj.has_key(j): term_docs[j] = (tf[0], Nj[j]) Git = int(math.floor(Gi[t]*100)) if Git >= 0: Git += 1 term_docs["Gi"] = (0, Git) run_sql("UPDATE %s SET hitlist=%%s WHERE term=%%s" % table, (serialize_via_marshal(term_docs), t)) except (ZeroDivisionError, OverflowError), e: register_exception(prefix="Error when analysing the term %s (%s): %s\n" % (t, repr(terms_docs), e), alert_admin=True) write_message("Phase 5: ......processed %s/%s terms" % ((i+5000>len(terms) and len(terms) or (i+5000)), len(terms))) i += 5000 write_message("Phase 5: Finished updating %s with new normalization values" % table) write_message("Time used for post-processing: %.1fmin" % ((time.time() - stime) / 60)) write_message("Finished post-processing") def get_from_forward_index(terms, start, stop, table): terms_docs = () for j in range(start, (stop < len(terms) and stop or len(terms))): terms_docs += run_sql("SELECT term, hitlist FROM %s WHERE term=%%s" % table, (terms[j],)) return terms_docs def get_from_forward_index_with_id(start, stop, table): terms_docs = run_sql("SELECT term, hitlist FROM %s WHERE id BETWEEN %s AND %s" % (table, start, stop)) return terms_docs def get_from_reverse_index(records, start, stop, table): current_recs = "%s" % records[start:stop] current_recs = current_recs[1:-1] docs_terms = run_sql("SELECT id_bibrec, termlist FROM %sR WHERE id_bibrec IN (%s)" % (table[:-1], current_recs)) return docs_terms #def test_word_separators(phrase="hep-th/0101001"): #"""Tests word separating policy on various input.""" #print "%s:" % phrase #gwfp = get_words_from_phrase(phrase) #for (word, count) in gwfp.iteritems(): #print "\t-> %s - %s" % (word, count) def getName(methname, ln=CFG_SITE_LANG, type='ln'): """Returns the name of the rank method, either in default language or given language. methname = short name of the method ln - the language to get the name in type - which name "type" to get.""" try: rnkid = run_sql("SELECT id FROM rnkMETHOD where name='%s'" % methname) if rnkid: rnkid = str(rnkid[0][0]) res = run_sql("SELECT value FROM rnkMETHODNAME where type='%s' and ln='%s' and id_rnkMETHOD=%s" % (type, ln, rnkid)) if not res: res = run_sql("SELECT value FROM rnkMETHODNAME WHERE ln='%s' and id_rnkMETHOD=%s and type='%s'" % (CFG_SITE_LANG, rnkid, type)) if not res: return methname return res[0][0] else: raise Exception except Exception, e: write_message("Cannot run rank method, either given code for method is wrong, or it has not been added using the webinterface.") raise Exception def word_similarity(run): """Call correct method""" return word_index(run) diff --git a/modules/miscutil/lib/dbquery.py b/modules/miscutil/lib/dbquery.py index 24581ffb7..eeed53256 100644 --- a/modules/miscutil/lib/dbquery.py +++ b/modules/miscutil/lib/dbquery.py @@ -1,359 +1,352 @@ ## This file is part of Invenio. ## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """ Invenio utilities to run SQL queries. The main API functions are: - run_sql() - run_sql_many() but see the others as well. """ __revision__ = "$Id$" # dbquery clients can import these from here: # pylint: disable=W0611 from MySQLdb import Warning, Error, InterfaceError, DataError, \ DatabaseError, OperationalError, IntegrityError, \ InternalError, NotSupportedError, \ ProgrammingError import string import time import marshal import re from zlib import compress, decompress from thread import get_ident from invenio.config import CFG_ACCESS_CONTROL_LEVEL_SITE, \ CFG_MISCUTIL_SQL_USE_SQLALCHEMY, \ CFG_MISCUTIL_SQL_RUN_SQL_MANY_LIMIT if CFG_MISCUTIL_SQL_USE_SQLALCHEMY: try: import sqlalchemy.pool as pool import MySQLdb as mysqldb mysqldb = pool.manage(mysqldb, use_threadlocal=True) connect = mysqldb.connect except ImportError: CFG_MISCUTIL_SQL_USE_SQLALCHEMY = False from MySQLdb import connect else: from MySQLdb import connect ## DB config variables. These variables are to be set in ## invenio-local.conf by admins and then replaced in situ in this file ## by calling "inveniocfg --update-dbexec". ## Note that they are defined here and not in config.py in order to ## prevent them from being exported accidentally elsewhere, as no-one ## should know DB credentials but this file. ## FIXME: this is more of a blast-from-the-past that should be fixed ## both here and in inveniocfg when the time permits. CFG_DATABASE_HOST = 'localhost' CFG_DATABASE_PORT = '3306' CFG_DATABASE_NAME = 'invenio' CFG_DATABASE_USER = 'invenio' CFG_DATABASE_PASS = 'my123p$ss' _DB_CONN = {} def _db_login(relogin = 0): """Login to the database.""" ## Note: we are using "use_unicode=False", because we want to ## receive strings from MySQL as Python UTF-8 binary string ## objects, not as Python Unicode string objects, as of yet. ## Note: "charset='utf8'" is needed for recent MySQLdb versions ## (such as 1.2.1_p2 and above). For older MySQLdb versions such ## as 1.2.0, an explicit "init_command='SET NAMES utf8'" parameter ## would constitute an equivalent. But we are not bothering with ## older MySQLdb versions here, since we are recommending to ## upgrade to more recent versions anyway. if CFG_MISCUTIL_SQL_USE_SQLALCHEMY: return connect(host=CFG_DATABASE_HOST, port=int(CFG_DATABASE_PORT), db=CFG_DATABASE_NAME, user=CFG_DATABASE_USER, passwd=CFG_DATABASE_PASS, use_unicode=False, charset='utf8') else: thread_ident = get_ident() if relogin: _DB_CONN[thread_ident] = connect(host=CFG_DATABASE_HOST, port=int(CFG_DATABASE_PORT), db=CFG_DATABASE_NAME, user=CFG_DATABASE_USER, passwd=CFG_DATABASE_PASS, use_unicode=False, charset='utf8') return _DB_CONN[thread_ident] else: if _DB_CONN.has_key(thread_ident): return _DB_CONN[thread_ident] else: _DB_CONN[thread_ident] = connect(host=CFG_DATABASE_HOST, port=int(CFG_DATABASE_PORT), db=CFG_DATABASE_NAME, user=CFG_DATABASE_USER, passwd=CFG_DATABASE_PASS, use_unicode=False, charset='utf8') return _DB_CONN[thread_ident] def _db_logout(): """Close a connection.""" try: del _DB_CONN[get_ident()] except KeyError: pass def run_sql(sql, param=None, n=0, with_desc=0): """Run SQL on the server with PARAM and return result. @param param: tuple of string params to insert in the query (see notes below) @param n: number of tuples in result (0 for unbounded) @param with_desc: if True, will return a DB API 7-tuple describing columns in query. @return: If SELECT, SHOW, DESCRIBE statements, return tuples of data, followed by description if parameter with_desc is provided. If INSERT, return last row id. Otherwise return SQL result as provided by database. @note: When the site is closed for maintenance (as governed by the config variable CFG_ACCESS_CONTROL_LEVEL_SITE), do not attempt to run any SQL queries but return empty list immediately. Useful to be able to have the website up while MySQL database is down for maintenance, hot copies, table repairs, etc. @note: In case of problems, exceptions are returned according to the Python DB API 2.0. The client code can import them from this file and catch them. """ if CFG_ACCESS_CONTROL_LEVEL_SITE == 3: # do not connect to the database as the site is closed for maintenance: return [] ### log_sql_query(sql, param) ### UNCOMMENT ONLY IF you REALLY want to log all queries if param: param = tuple(param) try: db = _db_login() cur = db.cursor() rc = cur.execute(sql, param) except OperationalError: # unexpected disconnect, bad malloc error, etc # FIXME: now reconnect is always forced, we may perhaps want to ping() first? try: db = _db_login(relogin=1) cur = db.cursor() rc = cur.execute(sql, param) except OperationalError: # again an unexpected disconnect, bad malloc error, etc raise if string.upper(string.split(sql)[0]) in ("SELECT", "SHOW", "DESC", "DESCRIBE"): if n: recset = cur.fetchmany(n) else: recset = cur.fetchall() if with_desc: return recset, cur.description else: return recset else: if string.upper(string.split(sql)[0]) == "INSERT": rc = cur.lastrowid return rc def run_sql_many(query, params, limit=CFG_MISCUTIL_SQL_RUN_SQL_MANY_LIMIT): """Run SQL on the server with PARAM. This method does executemany and is therefore more efficient than execute but it has sense only with queries that affect state of a database (INSERT, UPDATE). That is why the results just count number of affected rows @param params: tuple of tuple of string params to insert in the query @param limit: query will be executed in parts when number of parameters is greater than limit (each iteration runs at most `limit' parameters) @return: SQL result as provided by database """ i = 0 r = None while i < len(params): ## make partial query safely (mimicking procedure from run_sql()) try: db = _db_login() cur = db.cursor() rc = cur.executemany(query, params[i:i+limit]) except OperationalError: try: db = _db_login(relogin=1) cur = db.cursor() rc = cur.executemany(query, params[i:i+limit]) except OperationalError: raise ## collect its result: if r is None: r = rc else: r += rc i += limit return r def blob_to_string(ablob): """Return string representation of ABLOB. Useful to treat MySQL BLOBs in the same way for both recent and old MySQLdb versions. """ if ablob: if type(ablob) is str: # BLOB is already a string in MySQLdb 0.9.2 return ablob else: # BLOB is array.array in MySQLdb 1.0.0 and later return ablob.tostring() else: return ablob def log_sql_query(sql, param=None): """Log SQL query into prefix/var/log/dbquery.log log file. In order to enable logging of all SQL queries, please uncomment one line in run_sql() above. Useful for fine-level debugging only! """ from invenio.config import CFG_LOGDIR from invenio.dateutils import convert_datestruct_to_datetext from invenio.textutils import indent_text log_path = CFG_LOGDIR + '/dbquery.log' date_of_log = convert_datestruct_to_datetext(time.localtime()) message = date_of_log + '-->\n' message += indent_text('Query:\n' + indent_text(str(sql), 2, wrap=True), 2) message += indent_text('Params:\n' + indent_text(str(param), 2, wrap=True), 2) message += '-----------------------------\n\n' try: log_file = open(log_path, 'a+') log_file.writelines(message) log_file.close() except: pass def get_table_update_time(tablename): """Return update time of TABLENAME. TABLENAME can contain wildcard `%' in which case we return the maximum update time value. """ # Note: in order to work with all of MySQL 4.0, 4.1, 5.0, this # function uses SHOW TABLE STATUS technique with a dirty column # position lookup to return the correct value. (Making use of # Index_Length column that is either of type long (when there are # some indexes defined) or of type None (when there are no indexes # defined, e.g. table is empty). When we shall use solely # MySQL-5.0, we can employ a much cleaner technique of using # SELECT UPDATE_TIME FROM INFORMATION_SCHEMA.TABLES WHERE # table_name='collection'. res = run_sql("SHOW TABLE STATUS LIKE %s", (tablename, )) update_times = [] # store all update times for row in res: if type(row[10]) is long or \ row[10] is None: # MySQL-4.1 and 5.0 have creation_time in 11th position, # so return next column: update_times.append(str(row[12])) else: # MySQL-4.0 has creation_time in 10th position, which is # of type datetime.datetime or str (depending on the # version of MySQLdb), so return next column: update_times.append(str(row[11])) return max(update_times) def get_table_status_info(tablename): """Return table status information on TABLENAME. Returned is a dict with keys like Name, Rows, Data_length, Max_data_length, etc. If TABLENAME does not exist, return empty dict. """ # Note: again a hack so that it works on all MySQL 4.0, 4.1, 5.0 res = run_sql("SHOW TABLE STATUS LIKE %s", (tablename, )) table_status_info = {} # store all update times for row in res: if type(row[10]) is long or \ row[10] is None: # MySQL-4.1 and 5.0 have creation time in 11th position: table_status_info['Name'] = row[0] table_status_info['Rows'] = row[4] table_status_info['Data_length'] = row[6] table_status_info['Max_data_length'] = row[8] table_status_info['Create_time'] = row[11] table_status_info['Update_time'] = row[12] else: # MySQL-4.0 has creation_time in 10th position, which is # of type datetime.datetime or str (depending on the # version of MySQLdb): table_status_info['Name'] = row[0] table_status_info['Rows'] = row[3] table_status_info['Data_length'] = row[5] table_status_info['Max_data_length'] = row[7] table_status_info['Create_time'] = row[10] table_status_info['Update_time'] = row[11] return table_status_info def serialize_via_marshal(obj): """Serialize Python object via marshal into a compressed string.""" return compress(marshal.dumps(obj)) def deserialize_via_marshal(astring): """Decompress and deserialize string into a Python object via marshal.""" return marshal.loads(decompress(astring)) -try: - import psyco - psyco.bind(serialize_via_marshal) - psyco.bind(deserialize_via_marshal) -except StandardError, e: - pass - def wash_table_column_name(colname): """ Evaluate table-column name to see if it is clean. This function accepts only names containing [a-zA-Z0-9_]. @param colname: The string to be checked @type colname: str @return: colname if test passed @rtype: str @raise Exception: Raises an exception if colname is invalid. """ if re.search('[^\w]', colname): raise Exception('The table column %s is not valid.' % repr(colname)) return colname def real_escape_string(unescaped_string): """ Escapes special characters in the unescaped string for use in a DB query. @param unescaped_string: The string to be escaped @type unescaped_string: str @return: Returns the escaped string @rtype: str """ connection_object = _db_login() escaped_string = connection_object.escape_string(unescaped_string) return escaped_string diff --git a/modules/webhelp/web/hacking/coding-style.webdoc b/modules/webhelp/web/hacking/coding-style.webdoc index 6361818a0..c1eebfb87 100644 --- a/modules/webhelp/web/hacking/coding-style.webdoc +++ b/modules/webhelp/web/hacking/coding-style.webdoc @@ -1,229 +1,228 @@ ## -*- mode: html; coding: utf-8; -*- ## This file is part of Invenio. ## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

A brief description of things we strive at, more or less unsuccessfully.

1. Packaging

We use the classical GNU Autoconf/Automake approach, for tutorial see e.g. Learning the GNU development tools or the AutoBook.

2. Modules

Invenio started as a set of pretty independent modules developed by independent people with independent styles. This was even more pronounced by the original use of many different languages (e.g. Python, PHP, Perl). Now the Invenio code base is striving to use Python everywhere, except in speed-critical parts when a compiled language such as Common Lisp may come to the rescue in the near future.

When modifying an existing module, we propose to strictly continue using whatever coding style the module was originally written into. When writing new modules, we propose to stick to the below-mentioned standards.

The code integration across modules is happening, but is slow. Therefore, don't be surprised to see that there is a lot of room to refactor.

3. Python

We aim at following recommendations from PEP 8, although the existing code surely do not fulfil them here and there. The code indentation is done via spaces only, please do not use tabs. One tab counts as four spaces. Emacs users can look into our Emacs Tips wiki page for inspiration.

All the Python code should be extensively documented via docstrings, so you can always run pydoc file.py to peruse the file's documentation in one simple go. We follow the epytext docstring markup, which generates nice HTML source code documentation.

Do not forget to run pylint on your code to check for errors like uninitialized variables and to improve its quality and conformance to the coding standard. If you develop in Emacs, run M-x pylint RET on your buffers frequently. Read and implement pylint suggestions. (Note that using lambda and friends may lead to false pylint warnings. You can switch them off by putting block comments of the form ``# pylint: disable=C0301''.)

Do not forget to run pychecker on your code either. It is another source code checker that catches some situations better and some situations worse than pylint. If you develop in Emacs, run C-c C-w -(M-x py-pychecker-run RET) on your buffers frequently. (Note that -using psyco on classes may lead to false pychecker warnings.)

+(M-x py-pychecker-run RET) on your buffers frequently.

You can check the kwalitee of your code by running ``python modules/miscutil/lib/kwalitee.py --check-all *.py'' on your files. This will run some basic error checking, warning checking, indentation checking, but also compliance to PEP 8. You can also check the code kwalitee stats across all the modules by running ``make kwalitee-check'' in the main source directory.

Do not hardcode magic constants in your code. Every magic string or a number should be put into accompanying file_config.py with symbol name beginning by cfg_modulename_*.

Clearly separate interfaces from implementation. Document your interfaces. Do not expose to other modules anything that does not have to be exposed. Apply principle of least information.

Create as few new library files as possible. Do not create many nested files in nested modules; rather put all the lib files in one dir with bibindex_foo and bibindex_bar names.

Use imperative/functional paradigm rather then OO. If you do use OO, then stick to as simple class hierarchy as possible. Recall that method calls and exception handling in Python are quite expensive.

Use rather the good old foo_bar naming convention for symbols (both variables and function names) instead of fooBar CaMelCaSe convention. (Except for Class names where UppercaseSymbolNames are to be used.)

Pay special attention to name your symbols descriptively. Your code is going to be read and work with by others and its symbols should be self-understandable without any comments and without studying other parts of the code. For example, use proper English words, not abbreviations that can be misspelled in many a way; use words that go in pair (e.g. create/destroy, start/stop; never create/stop); use self-understandable symbol names (e.g. list_of_file_extensions rather than list2); never misname symbols (e.g. score_list should hold the list of scores and nothing else - if in the course of development you change the semantics of what the symbol holds then change the symbol name too). Do not be afraid to use long descriptive names; good editors such as Emacs can tab-complete symbols for you.

When hacking module A, pay close attention to ressemble existing coding convention in A, even if it is legacy-weird and even if we use a different technique elsewhere. (Unless the whole module A is going to be refactored, of course.)

Speed-critical parts should be profiled with pyprof or our built-in web profiler (&profile=t).

The code should be well tested before committed. Testing is an integral part of the development process. Test along as you program. The testing process should be automatized via our unit test and regression test suite infrastructures. Please read the test suite strategy to know more.

Python promotes writing clear, readable, easily maintainable code. Write it as such. Recall Albert Einstein's ``Everything should be made as simple as possible, but not simpler''. Things should be neither overengineered nor oversimplified.

Recall principles Unix is built upon. As summarized by Eric S. Reymond's TAOUP:

  • Rule of Modularity: Write simple parts connected by clean interfaces.
  • Rule of Clarity: Clarity is better than cleverness.
  • Rule of Composition: Design programs to be connected with other programs.
  • Rule of Separation: Separate policy from mechanism; separate interfaces from engines.
  • Rule of Simplicity: Design for simplicity; add complexity only where you must.
  • Rule of Parsimony: Write a big program only when it is clear by demonstration that nothing else will do.
  • Rule of Transparency: Design for visibility to make inspection and debugging easier.
  • Rule of Robustness: Robustness is the child of transparency and simplicity.
  • Rule of Representation: Fold knowledge into data, so program logic can be stupid and robust.
  • Rule of Least Surprise: In interface design, always do the least surprising thing.
  • Rule of Silence: When a program has nothing surprising to say, it should say nothing.
  • Rule of Repair: Repair what you can -- but when you must fail, fail noisily and as soon as possible.
  • Rule of Economy: Programmer time is expensive; conserve it in preference to machine time.
  • Rule of Generation: Avoid hand-hacking; write programs to write programs when you can.
  • Rule of Optimization: Prototype before polishing. Get it working before you optimize it.
  • Rule of Diversity: Distrust all claims for one true way.
  • Rule of Extensibility: Design for the future, because it will be here sooner than you think.
or the golden rule that says it all: ``keep it simple''.

Think of security and robustness from the start. Follow secure programming guidelines.

For more hints, thoughts, and other ruminations on programming, see our CDS Invenio wiki, notably Git Workflow and Invenio QA.

3. MySQL

Table naming policy is, roughly and briefly:

  • "foo": table names in lowercase, without prefix, used by me for WebSearch
  • "foo_bar": underscores represent M:N relationship between "foo" and "bar", to tie the two tables together
  • "bib*": many tables to hold the metadata and relationships between them
  • "idx*": idx is the table name prefix used by BibIndex
  • "rnk*": rnk is the table name prefix used by BibRank
  • "fmt*": fmt is the table name prefix used by BibFormat
  • "sbm*": sbm is the table name prefix used by WebSubmit
  • "sch*": sch is the table name prefix used by BibSched
  • "acc*": acc is the table name prefix used by WebAccess
  • "bsk*": acc is the table name prefix used by WebBasket
  • "msg*": acc is the table name prefix used by WebMessage
  • "cls*": acc is the table name prefix used by BibClassify
  • "sta*": acc is the table name prefix used by WebStat
  • "jrn*": acc is the table name prefix used by WebJournal
  • "collection*": many tables to describe collections and search interface pages
  • "user*" : many tables to describe personal features (baskets, alerts)
  • "hst*": tables related to historical versions of metadata and fulltext files
- end of file -