diff --git a/INSTALL b/INSTALL index b1dc674e2..0ed827133 100644 --- a/INSTALL +++ b/INSTALL @@ -1,824 +1,825 @@ Invenio INSTALLATION ==================== About ===== This document specifies how to build, customize, and install Invenio v1.1.1 for the first time. See RELEASE-NOTES if you are upgrading from a previous Invenio release. Contents ======== 0. Prerequisites 1. Quick instructions for the impatient Invenio admin 2. Detailed instructions for the patient Invenio admin 0. Prerequisites ================ Here is the software you need to have around before you start installing Invenio: a) Unix-like operating system. The main development and production platforms for Invenio at CERN are GNU/Linux distributions Debian, Gentoo, Scientific Linux (aka RHEL), Ubuntu, but we also develop on Mac OS X. Basically any Unix system supporting the software listed below should do. If you are using Debian GNU/Linux ``Lenny'' or later, then you can install most of the below-mentioned prerequisites and recommendations by running: $ sudo aptitude install python-dev apache2-mpm-prefork \ mysql-server mysql-client python-mysqldb \ python-4suite-xml python-simplejson python-xml \ python-libxml2 python-libxslt1 gnuplot poppler-utils \ gs-common clisp gettext libapache2-mod-wsgi unzip \ python-dateutil python-rdflib \ python-gnuplot python-magic pdftk html2text giflib-tools \ pstotext netpbm python-pypdf python-chardet python-lxml You may also want to install some of the following packages, if you have them available on your concrete architecture: $ sudo aptitude install sbcl cmucl pylint pychecker pyflakes \ python-profiler python-epydoc libapache2-mod-xsendfile \ openoffice.org python-utidylib python-beautifulsoup Moreover, you should install some Message Transfer Agent (MTA) such as Postfix so that Invenio can email notification alerts or registration information to the end users, contact moderators and reviewers of submitted documents, inform administrators about various runtime system information, etc: $ sudo aptitude install postfix After running the above-quoted aptitude command(s), you can proceed to configuring your MySQL server instance (max_allowed_packet in my.cnf, see item 0b below) and then to installing the Invenio software package in the section 1 below. If you are using another operating system, then please continue reading the rest of this prerequisites section, and please consult our wiki pages for any concrete hints for your specific operating system. b) MySQL server (may be on a remote machine), and MySQL client (must be available locally too). MySQL versions 4.1 or 5.0 are supported. Please set the variable "max_allowed_packet" in your "my.cnf" init file to at least 4M. (For sites such as INSPIRE, having 1M records with 10M citer-citee pairs in its citation map, you may need to increase max_allowed_packet to 1G.) You may perhaps also want to run your MySQL server natively in UTF-8 mode by setting "default-character-set=utf8" in various parts of your "my.cnf" file, such as in the "[mysql]" part and elsewhere; but this is not really required. c) Apache 2 server, with support for loading DSO modules, and optionally with SSL support for HTTPS-secure user authentication, and mod_xsendfile for off-loading file downloads away from Invenio processes to Apache. d) Python v2.4 or above: as well as the following Python modules: - (mandatory) MySQLdb (version >= 1.2.1_p2; see below) - (recommended) python-dateutil, for complex date processing: - (recommended) PyXML, for XML processing: - (recommended) PyRXP, for very fast XML MARC processing: - (recommended) lxml, for XML/XLST processing: - (recommended) libxml2-python, for XML/XLST processing: - (recommended) simplejson, for AJAX apps: Note that if you are using Python-2.6, you don't need to install simplejson, because the module is already included in the main Python distribution. - (recommended) Gnuplot.Py, for producing graphs: - (recommended) Snowball Stemmer, for stemming: - (recommended) py-editdist, for record merging: - (recommended) numpy, for citerank methods: - (recommended) magic, for full-text file handling: - (optional) chardet, for character encoding detection: - (optional) 4suite, slower alternative to PyRXP and libxml2-python: - (optional) feedparser, for web journal creation: - (optional) RDFLib, to use RDF ontologies and thesauri: - (optional) mechanize, to run regression web test suite: - (optional) python-mock, mocking library for the test suite: - (optional) hashlib, needed only for Python-2.4 and only if you would like to use AWS connectivity: - (optional) utidylib, for HTML washing: - (optional) Beautiful Soup, for HTML washing: - (optional) Python Twitter (and its dependencies) if you want to use the Twitter Fetcher bibtasklet: Note: MySQLdb version 1.2.1_p2 or higher is recommended. If you are using an older version of MySQLdb, you may get into problems with character encoding. e) mod_wsgi Apache module. Versions 3.x and above are recommended. Note: if you are using Python 2.4 or earlier, then you should also install the wsgiref Python module, available from: (As of Python 2.5 this module is included in standard Python distribution.) f) If you want to be able to extract references from PDF fulltext files, then you need to install pdftotext version 3 at least. g) If you want to be able to search for words in the fulltext files (i.e. to have fulltext indexing) or to stamp submitted files, then you need as well to install some of the following tools: - for Microsoft Office/OpenOffice.org document conversion: OpenOffice.org - for PDF file stamping: pdftk, pdf2ps - for PDF files: pdftotext or pstotext - for PostScript files: pstotext or ps2ascii - for DjVu creation, elaboration: DjVuLibre - to perform OCR: OCRopus (tested only with release 0.3.1) - to perform different image elaborations: ImageMagick - - to generate PDF after OCR: netpbm, ReportLab and pyPdf + - to generate PDF after OCR: netpbm, ReportLab and pyPdf or pyPdf2 + h) If you have chosen to install fast XML MARC Python processors in the step d) above, then you have to install the parsers themselves: - (optional) 4suite: i) (recommended) Gnuplot, the command-line driven interactive plotting program. It is used to display download and citation history graphs on the Detailed record pages on the web interface. Note that Gnuplot must be compiled with PNG output support, that is, with the GD library. Note also that Gnuplot is not required, only recommended. j) (recommended) A Common Lisp implementation, such as CLISP, SBCL or CMUCL. It is used for the web server log analysing tool and the metadata checking program. Note that any of the three implementations CLISP, SBCL, or CMUCL will do. CMUCL produces fastest machine code, but it does not support UTF-8 yet. Pick up CLISP if you don't know what to do. Note that a Common Lisp implementation is not required, only recommended. k) GNU gettext, a set of tools that makes it possible to translate the application in multiple languages. This is available by default on many systems. l) (recommended) xlwt 0.7.2, Library to create spreadsheet files compatible with MS Excel 97/2000/XP/2003 XLS files, on any platform, with Python 2.3 to 2.6 m) (recommended) matplotlib 1.0.0 is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. matplotlib can be used in python scripts, the python and ipython shell (ala MATLAB® or Mathematica®), web application servers, and six graphical user interface toolkits. It is used to generate pie graphs in the custom summary query (WebStat) n) (optional) FFmpeg, an open-source tools an libraries collection to convert video and audio files. It makes use of both internal as well as external libraries to generate videos for the web, such as Theora, WebM and H.264 out of almost any thinkable video input. FFmpeg is needed to run video related modules and submission workflows in Invenio. The minimal configuration of ffmpeg for the Invenio demo site requires a number of external libraries. It is highly recommended to remove all installed versions and packages that are comming with various Linux distributions and install the latest versions from sources. Additionally, you will need the Mediainfo Library for multimedia metadata handling. Minimum libraries for the demo site: - the ffmpeg multimedia encoder tools - a library for jpeg images needed for thumbnail extraction - a library for the ogg container format, needed for Vorbis and Theora - the OGG Vorbis audi codec library - the OGG Theora video codec library - the WebM video codec library - the mediainfo library for multimedia metadata Recommended for H.264 video (!be aware of licensing issues!): - a library for H.264 video encoding - a library for Advanced Audi Coding - a library for MP3 encoding Note that the configure script checks whether you have all the prerequisite software installed and that it won't let you continue unless everything is in order. It also warns you if it cannot find some optional but recommended software. 1. Quick instructions for the impatient Invenio admin ========================================================= 1a. Installation ---------------- $ cd $HOME/src/ $ wget http://invenio-software.org/download/invenio-1.1.1.tar.gz $ wget http://invenio-software.org/download/invenio-1.1.1.tar.gz.md5 $ wget http://invenio-software.org/download/invenio-1.1.1.tar.gz.sig $ md5sum -c invenio-1.1.1.tar.gz.md5 $ gpg --verify invenio-1.1.1.tar.gz.sig invenio-1.1.1.tar.gz $ tar xvfz invenio-1.1.1.tar.gz $ cd invenio-1.1.1 $ ./configure $ make $ make install $ make install-mathjax-plugin ## optional $ make install-jquery-plugins ## optional $ make install-ckeditor-plugin ## optional $ make install-pdfa-helper-files ## optional $ make install-mediaelement ## optional $ make install-solrutils ## optional $ make install-js-test-driver ## optional 1b. Configuration ----------------- $ sudo chown -R www-data.www-data /opt/invenio $ sudo -u www-data emacs /opt/invenio/etc/invenio-local.conf $ sudo -u www-data /opt/invenio/bin/inveniocfg --update-all $ sudo -u www-data /opt/invenio/bin/inveniocfg --create-tables $ sudo -u www-data /opt/invenio/bin/inveniocfg --load-webstat-conf $ sudo -u www-data /opt/invenio/bin/inveniocfg --create-apache-conf $ sudo /etc/init.d/apache2 restart $ sudo -u www-data /opt/invenio/bin/inveniocfg --check-openoffice $ sudo -u www-data /opt/invenio/bin/inveniocfg --create-demo-site $ sudo -u www-data /opt/invenio/bin/inveniocfg --load-demo-records $ sudo -u www-data /opt/invenio/bin/inveniocfg --run-unit-tests $ sudo -u www-data /opt/invenio/bin/inveniocfg --run-regression-tests $ sudo -u www-data /opt/invenio/bin/inveniocfg --run-web-tests $ sudo -u www-data /opt/invenio/bin/inveniocfg --remove-demo-records $ sudo -u www-data /opt/invenio/bin/inveniocfg --drop-demo-site $ firefox http://your.site.com/help/admin/howto-run 2. Detailed instructions for the patient Invenio admin ========================================================== 2a. Installation ---------------- The Invenio uses standard GNU autoconf method to build and install its files. This means that you proceed as follows: $ cd $HOME/src/ Change to a directory where we will build the Invenio sources. (The built files will be installed into different "target" directories later.) $ wget http://invenio-software.org/download/invenio-1.1.1.tar.gz $ wget http://invenio-software.org/download/invenio-1.1.1.tar.gz.md5 $ wget http://invenio-software.org/download/invenio-1.1.1.tar.gz.sig Fetch Invenio source tarball from the distribution server, together with MD5 checksum and GnuPG cryptographic signature files useful for verifying the integrity of the tarball. $ md5sum -c invenio-1.1.1.tar.gz.md5 Verify MD5 checksum. $ gpg --verify invenio-1.1.1.tar.gz.sig invenio-1.1.1.tar.gz Verify GnuPG cryptographic signature. Note that you may first have to import my public key into your keyring, if you haven't done that already: $ gpg --keyserver wwwkeys.eu.pgp.net --recv-keys 0xBA5A2B67 The output of the gpg --verify command should then read: Good signature from "Tibor Simko " You can safely ignore any trusted signature certification warning that may follow after the signature has been successfully verified. $ tar xvfz invenio-1.1.1.tar.gz Untar the distribution tarball. $ cd invenio-1.1.1 Go to the source directory. $ ./configure Configure Invenio software for building on this specific platform. You can use the following optional parameters: --prefix=/opt/invenio Optionally, specify the Invenio general installation directory (default is /opt/invenio). It will contain command-line binaries and program libraries containing the core Invenio functionality, but also store web pages, runtime log and cache information, document data files, etc. Several subdirs like `bin', `etc', `lib', or `var' will be created inside the prefix directory to this effect. Note that the prefix directory should be chosen outside of the Apache htdocs tree, since only one its subdirectory (prefix/var/www) is to be accessible directly via the Web (see below). Note that Invenio won't install to any other directory but to the prefix mentioned in this configuration line. --with-python=/opt/python/bin/python2.4 Optionally, specify a path to some specific Python binary. This is useful if you have more than one Python installation on your system. If you don't set this option, then the first Python that will be found in your PATH will be chosen for running Invenio. --with-mysql=/opt/mysql/bin/mysql Optionally, specify a path to some specific MySQL client binary. This is useful if you have more than one MySQL installation on your system. If you don't set this option, then the first MySQL client executable that will be found in your PATH will be chosen for running Invenio. --with-clisp=/opt/clisp/bin/clisp Optionally, specify a path to CLISP executable. This is useful if you have more than one CLISP installation on your system. If you don't set this option, then the first executable that will be found in your PATH will be chosen for running Invenio. --with-cmucl=/opt/cmucl/bin/lisp Optionally, specify a path to CMUCL executable. This is useful if you have more than one CMUCL installation on your system. If you don't set this option, then the first executable that will be found in your PATH will be chosen for running Invenio. --with-sbcl=/opt/sbcl/bin/sbcl Optionally, specify a path to SBCL executable. This is useful if you have more than one SBCL installation on your system. If you don't set this option, then the first executable that will be found in your PATH will be chosen for running Invenio. --with-openoffice-python Optionally, specify the path to the Python interpreter embedded with OpenOffice.org. This is normally not contained in the normal path. If you don't specify this it won't be possible to use OpenOffice.org to convert from and to Microsoft Office and OpenOffice.org documents. This configuration step is mandatory. Usually, you do this step only once. (Note that if you are building Invenio not from a released tarball, but from the Git sources, then you have to generate the configure file via autotools: $ sudo aptitude install automake1.9 autoconf $ aclocal-1.9 $ automake-1.9 -a $ autoconf after which you proceed with the usual configure command.) $ make Launch the Invenio build. Since many messages are printed during the build process, you may want to run it in a fast-scrolling terminal such as rxvt or in a detached screen session. During this step all the pages and scripts will be pre-created and customized based on the config you have edited in the previous step. Note that on systems such as FreeBSD or Mac OS X you have to use GNU make ("gmake") instead of "make". $ make install Install the web pages, scripts, utilities and everything needed for Invenio runtime into respective installation directories, as specified earlier by the configure command. Note that if you are installing Invenio for the first time, you will be asked to create symbolic link(s) from Python's site-packages system-wide directory(ies) to the installation location. This is in order to instruct Python where to find Invenio's Python files. You will be hinted as to the exact command to use based on the parameters you have used in the configure command. $ make install-mathjax-plugin ## optional This will automatically download and install in the proper place MathJax, a JavaScript library to render LaTeX formulas in the client browser. Note that in order to enable the rendering you will have to set the variable CFG_WEBSEARCH_USE_MATHJAX_FOR_FORMATS in invenio-local.conf to a suitable list of output format codes. For example: CFG_WEBSEARCH_USE_MATHJAX_FOR_FORMATS = hd,hb $ make install-jquery-plugins ## optional This will automatically download and install in the proper place jQuery and related plugins. They are used for AJAX applications such as the record editor. Note that `unzip' is needed when installing jquery plugins. $ make install-ckeditor-plugin ## optional This will automatically download and install in the proper place CKeditor, a WYSIWYG Javascript-based editor (e.g. for the WebComment module). Note that in order to enable the editor you have to set the CFG_WEBCOMMENT_USE_RICH_EDITOR to True. $ make install-pdfa-helper-files ## optional This will automatically download and install in the proper place the helper files needed to create PDF/A files out of existing PDF files. $ make install-mediaelement ## optional This will automatically download and install the MediaElementJS HTML5 video player that is needed for videos on the DEMO site. $ make install-solrutils ## optional This will automatically download and install a Solr instance which can be used for full-text searching. See CFG_SOLR_URL variable in the invenio.conf. Note that the admin later has to take care of running init.d scripts which would start the Solr instance automatically. $ make install-js-test-driver ## optional This will automatically download and install JsTestDriver which is needed to run JS unit tests. Recommended for developers. 2b. Configuration ----------------- Once the basic software installation is done, we proceed to configuring your Invenio system. $ sudo chown -R www-data.www-data /opt/invenio For the sake of simplicity, let us assume that your Invenio installation will run under the `www-data' user process identity. The above command changes ownership of installed files to www-data, so that we shall run everything under this user identity from now on. For production purposes, you would typically enable Apache server to read all files from the installation place but to write only to the `var' subdirectory of your installation place. You could achieve this by configuring Unix directory group permissions, for example. $ sudo -u www-data emacs /opt/invenio/etc/invenio-local.conf Customize your Invenio installation. Please read the 'invenio.conf' file located in the same directory that contains the vanilla default configuration parameters of your Invenio installation. If you want to customize some of these parameters, you should create a file named 'invenio-local.conf' in the same directory where 'invenio.conf' lives and you should write there only the customizations that you want to be different from the vanilla defaults. Here is a realistic, minimalist, yet production-ready example of what you would typically put there: $ cat /opt/invenio/etc/invenio-local.conf [Invenio] CFG_SITE_NAME = John Doe's Document Server CFG_SITE_NAME_INTL_fr = Serveur des Documents de John Doe CFG_SITE_URL = http://your.site.com CFG_SITE_SECURE_URL = https://your.site.com CFG_SITE_ADMIN_EMAIL = john.doe@your.site.com CFG_SITE_SUPPORT_EMAIL = john.doe@your.site.com CFG_WEBALERT_ALERT_ENGINE_EMAIL = john.doe@your.site.com CFG_WEBCOMMENT_ALERT_ENGINE_EMAIL = john.doe@your.site.com CFG_WEBCOMMENT_DEFAULT_MODERATOR = john.doe@your.site.com CFG_DATABASE_HOST = localhost CFG_DATABASE_NAME = invenio CFG_DATABASE_USER = invenio CFG_DATABASE_PASS = my123p$ss CFG_BIBDOCFILE_ENABLE_BIBDOCFSINFO_CACHE = 1 You should override at least the parameters mentioned above in order to define some very essential runtime parameters such as the name of your document server (CFG_SITE_NAME and CFG_SITE_NAME_INTL_*), the visible URL of your document server (CFG_SITE_URL and CFG_SITE_SECURE_URL), the email address of the local Invenio administrator, comment moderator, and alert engine (CFG_SITE_SUPPORT_EMAIL, CFG_SITE_ADMIN_EMAIL, etc), and last but not least your database credentials (CFG_DATABASE_*). If this is a first installation of Invenio it is recommended you set the CFG_BIBDOCFILE_ENABLE_BIBDOCFSINFO_CACHE variable to 1. If this is instead an upgrade from an existing installation don't add it until you have run: $ bibdocfile --fix-bibdocfsinfo-cache . The Invenio system will then read both the default invenio.conf file and your customized invenio-local.conf file and it will override any default options with the ones you have specifield in your local file. This cascading of configuration parameters will ease your future upgrades. If you want to have multiple Invenio instances for distributed video encoding, you need to share the same configuration amongs them and make some of the folders of the Invenio installation available for all nodes. Configure the allowed tasks for every node: CFG_BIBSCHED_NODE_TASKS = { "hostname_machine1" : ["bibindex", "bibupload", "bibreformat","webcoll", "bibtaskex", "bibrank", "oaiharvest", "oairepositoryupdater", "inveniogc", "webstatadmin", "bibclassify", "bibexport", "dbdump", "batchuploader", "bibauthorid", "bibtasklet"], "hostname_machine2" : ['bibencode',] } Share the following directories among Invenio instances: /var/tmp-shared hosts video uploads in a temporary form /var/tmp-shared/bibencode/jobs hosts new job files for the video encoding daemon /var/tmp-shared/bibencode/jobs/done hosts job files that have been processed by the daemon /var/data/files hosts fulltext and media files associated to records /var/data/submit hosts files created during submissions $ sudo -u www-data /opt/invenio/bin/inveniocfg --update-all Make the rest of the Invenio system aware of your invenio-local.conf changes. This step is mandatory each time you edit your conf files. $ sudo -u www-data /opt/invenio/bin/inveniocfg --create-tables If you are installing Invenio for the first time, you have to create database tables. Note that this step checks for potential problems such as the database connection rights and may ask you to perform some more administrative steps in case it detects a problem. Notably, it may ask you to set up database access permissions, based on your configure values. If you are installing Invenio for the first time, you have to create a dedicated database on your MySQL server that the Invenio can use for its purposes. Please contact your MySQL administrator and ask him to execute the commands this step proposes you. At this point you should now have successfully completed the "make install" process. We continue by setting up the Apache web server. $ sudo -u www-data /opt/invenio/bin/inveniocfg --load-webstat-conf Load the configuration file of webstat module. It will create the tables in the database for register customevents, such as basket hits. $ sudo -u www-data /opt/invenio/bin/inveniocfg --create-apache-conf Running this command will generate Apache virtual host configurations matching your installation. You will be instructed to check created files (usually they are located under /opt/invenio/etc/apache/) and edit your httpd.conf to activate Invenio virtual hosts. If you are using Debian GNU/Linux ``Lenny'' or later, then you can do the following to create your SSL certificate and to activate your Invenio vhosts: ## make SSL certificate: $ sudo aptitude install ssl-cert $ sudo mkdir /etc/apache2/ssl $ sudo /usr/sbin/make-ssl-cert /usr/share/ssl-cert/ssleay.cnf \ /etc/apache2/ssl/apache.pem ## add Invenio web sites: $ sudo ln -s /opt/invenio/etc/apache/invenio-apache-vhost.conf \ /etc/apache2/sites-available/invenio $ sudo ln -s /opt/invenio/etc/apache/invenio-apache-vhost-ssl.conf \ /etc/apache2/sites-available/invenio-ssl ## disable Debian's default web site: $ sudo /usr/sbin/a2dissite default ## enable Invenio web sites: $ sudo /usr/sbin/a2ensite invenio $ sudo /usr/sbin/a2ensite invenio-ssl ## enable SSL module: $ sudo /usr/sbin/a2enmod ssl ## if you are using xsendfile module, enable it too: $ sudo /usr/sbin/a2enmod xsendfile If you are using another operating system, you should do the equivalent, for example edit your system-wide httpd.conf and put the following include statements: Include /opt/invenio/etc/apache/invenio-apache-vhost.conf Include /opt/invenio/etc/apache/invenio-apache-vhost-ssl.conf Note that you may need to adapt generated vhost file snippets to match your concrete operating system specifics. For example, the generated configuration snippet will preload Invenio WSGI daemon application upon Apache start up for faster site response. The generated configuration assumes that you are using mod_wsgi version 3 or later. If you are using the old legacy mod_wsgi version 2, then you would need to comment out the WSGIImportScript directive from the generated snippet, or else move the WSGI daemon setup to the top level, outside of the VirtualHost section. Note also that you may want to tweak the generated Apache vhost snippet for performance reasons, especially with respect to WSGIDaemonProcess parameters. For example, you can increase the number of processes from the default value `processes=5' if you have lots of RAM and if many concurrent users may access your site in parallel. However, note that you must use `threads=1' there, because Invenio WSGI daemon processes are not fully thread safe yet. This may change in the future. $ sudo /etc/init.d/apache2 restart Please ask your webserver administrator to restart the Apache server after the above "httpd.conf" changes. $ sudo -u www-data /opt/invenio/bin/inveniocfg --check-openoffice If you plan to support MS Office or Open Document Format files in your installation, you should check whether LibreOffice or OpenOffice.org is well integrated with Invenio by running the above command. You may be asked to create a temporary directory for converting office files with special ownership (typically as user nobody) and permissions. Note that you can do this step later. $ sudo -u www-data /opt/invenio/bin/inveniocfg --create-demo-site This step is recommended to test your local Invenio installation. It should give you our "Atlantis Institute of Science" demo installation, exactly as you see it at . $ sudo -u www-data /opt/invenio/bin/inveniocfg --load-demo-records Optionally, load some demo records to be able to test indexing and searching of your local Invenio demo installation. $ sudo -u www-data /opt/invenio/bin/inveniocfg --run-unit-tests Optionally, you can run the unit test suite to verify the unit behaviour of your local Invenio installation. Note that this command should be run only after you have installed the whole system via `make install'. $ sudo -u www-data /opt/invenio/bin/inveniocfg --run-regression-tests Optionally, you can run the full regression test suite to verify the functional behaviour of your local Invenio installation. Note that this command requires to have created the demo site and loaded the demo records. Note also that running the regression test suite may alter the database content with junk data, so that rebuilding the demo site is strongly recommended afterwards. $ sudo -u www-data /opt/invenio/bin/inveniocfg --run-web-tests Optionally, you can run additional automated web tests running in a real browser. This requires to have Firefox with the Selenium IDE extension installed. $ sudo -u www-data /opt/invenio/bin/inveniocfg --remove-demo-records Optionally, remove the demo records loaded in the previous step, but keeping otherwise the demo collection, submission, format, and other configurations that you may reuse and modify for your own production purposes. $ sudo -u www-data /opt/invenio/bin/inveniocfg --drop-demo-site Optionally, drop also all the demo configuration so that you'll end up with a completely blank Invenio system. However, you may want to find it more practical not to drop the demo site configuration but to start customizing from there. $ firefox http://your.site.com/help/admin/howto-run In order to start using your Invenio installation, you can start indexing, formatting and other daemons as indicated in the "HOWTO Run" guide on the above URL. You can also use the Admin Area web interfaces to perform further runtime configurations such as the definition of data collections, document types, document formats, word indexes, etc. $ sudo ln -s /opt/invenio/etc/bash_completion.d/inveniocfg \ /etc/bash_completion.d/inveniocfg Optionally, if you are using Bash shell completion, then you may want to create the above symlink in order to configure completion for the inveniocfg command. Good luck, and thanks for choosing Invenio. - Invenio Development Team diff --git a/Makefile.am b/Makefile.am index 672600c48..9ef70fd1b 100644 --- a/Makefile.am +++ b/Makefile.am @@ -1,455 +1,455 @@ ## This file is part of Invenio. ## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. confignicedir = $(sysconfdir)/build confignice_SCRIPTS=config.nice SUBDIRS = po config modules EXTRA_DIST = UNINSTALL THANKS RELEASE-NOTES configure-tests.py config.nice.in \ config.rpath # current MathJax version and packages # See also modules/miscutil/lib/htmlutils.py (get_mathjax_header) MJV = 2.1 MATHJAX = http://invenio-software.org/download/mathjax/MathJax-v$(MJV).zip # current CKeditor version CKV = 3.6.6 CKEDITOR = ckeditor_$(CKV).zip # current MediaElement.js version MEV = master MEDIAELEMENT = http://github.com/johndyer/mediaelement/zipball/$(MEV) #for solrutils INVENIO_JAVA_PATH = org/invenio_software/solr solrdirname = apache-solr-3.1.0 solrdir = $(prefix)/lib/$(solrdirname) solrutils_dir=$(CURDIR)/modules/miscutil/lib/solrutils CLASSPATH=.:${solrdir}/dist/solrj-lib/commons-io-1.4.jar:${solrdir}/dist/apache-solr-core-*jar:${solrdir}/contrib/jzlib-1.0.7.jar:${solrdir}/dist/apache-solr-solrj-3.1.0.jar:${solrdir}/dist/solrj-lib/slf4j-api-1.5.5.jar:${solrdir}/dist/*:${solrdir}/contrib/basic-lucene-libs/*:${solrdir}/contrib/analysis-extras/lucene-libs/*:${solrdir}/dist/solrj-lib/* # git-version-get stuff: BUILT_SOURCES = $(top_srcdir)/.version $(top_srcdir)/.version: echo $(VERSION) > $@-t && mv $@-t $@ dist-hook: echo $(VERSION) > $(distdir)/.tarball-version check-upgrade: $(PYTHON) $(top_srcdir)/modules/miscutil/lib/inveniocfg_upgrader.py $(top_srcdir) --upgrade-check kwalitee-check: @$(PYTHON) $(top_srcdir)/modules/miscutil/lib/kwalitee.py --stats $(top_srcdir) kwalitee-check-errors-only: @$(PYTHON) $(top_srcdir)/modules/miscutil/lib/kwalitee.py --check-errors $(top_srcdir) kwalitee-check-variables: @$(PYTHON) $(top_srcdir)/modules/miscutil/lib/kwalitee.py --check-variables $(top_srcdir) kwalitee-check-indentation: @$(PYTHON) $(top_srcdir)/modules/miscutil/lib/kwalitee.py --check-indentation $(top_srcdir) kwalitee-check-sql-queries: @$(PYTHON) $(top_srcdir)/modules/miscutil/lib/kwalitee.py --check-sql $(top_srcdir) etags: \rm -f $(top_srcdir)/TAGS (cd $(top_srcdir) && find $(top_srcdir) -name "*.py" -print | xargs etags) install-data-local: for d in / /cache /cache/RTdata /log /tmp /tmp-shared /data /run /tmp-shared/bibencode/jobs/done /tmp-shared/bibedit-cache; do \ mkdir -p $(localstatedir)$$d ; \ done @echo "************************************************************" @echo "** Invenio software has been successfully installed! **" @echo "** **" @echo "** You may proceed to customizing your installation now. **" @echo "************************************************************" install-mathjax-plugin: @echo "***********************************************************" @echo "** Installing MathJax plugin, please wait... **" @echo "***********************************************************" rm -rf /tmp/invenio-mathjax-plugin mkdir /tmp/invenio-mathjax-plugin rm -fr ${prefix}/var/www/MathJax mkdir -p ${prefix}/var/www/MathJax (cd /tmp/invenio-mathjax-plugin && \ wget '$(MATHJAX)' -O mathjax.zip && \ - unzip -q mathjax.zip && cd mathjax-MathJax-* && cp -ur * \ + unzip -q mathjax.zip && cd mathjax-MathJax-* && cp -r * \ ${prefix}/var/www/MathJax) rm -fr /tmp/invenio-mathjax-plugin @echo "************************************************************" @echo "** The MathJax plugin was successfully installed. **" @echo "** Please do not forget to properly set the option **" @echo "** CFG_WEBSEARCH_USE_MATHJAX_FOR_FORMATS and **" @echo "** CFG_WEBSUBMIT_USE_MATHJAX in invenio.conf. **" @echo "************************************************************" uninstall-mathjax-plugin: @rm -rvf ${prefix}/var/www/MathJax @echo "***********************************************************" @echo "** The MathJax plugin was successfully uninstalled. **" @echo "***********************************************************" install-jscalendar-plugin: @echo "***********************************************************" @echo "** Installing jsCalendar plugin, please wait... **" @echo "***********************************************************" rm -rf /tmp/invenio-jscalendar-plugin mkdir /tmp/invenio-jscalendar-plugin (cd /tmp/invenio-jscalendar-plugin && \ wget 'http://www.dynarch.com/static/jscalendar-1.0.zip' && \ unzip -u jscalendar-1.0.zip && \ mkdir -p ${prefix}/var/www/jsCalendar && \ cp jscalendar-1.0/img.gif ${prefix}/var/www/jsCalendar/jsCalendar.gif && \ cp jscalendar-1.0/calendar.js ${prefix}/var/www/jsCalendar/ && \ cp jscalendar-1.0/calendar-setup.js ${prefix}/var/www/jsCalendar/ && \ cp jscalendar-1.0/lang/calendar-en.js ${prefix}/var/www/jsCalendar/ && \ cp jscalendar-1.0/calendar-blue.css ${prefix}/var/www/jsCalendar/) rm -fr /tmp/invenio-jscalendar-plugin @echo "***********************************************************" @echo "** The jsCalendar plugin was successfully installed. **" @echo "***********************************************************" uninstall-jscalendar-plugin: @rm -rvf ${prefix}/var/www/jsCalendar @echo "***********************************************************" @echo "** The jsCalendar plugin was successfully uninstalled. **" @echo "***********************************************************" install-js-test-driver: @echo "*******************************************************" @echo "** Installing js-test-driver, please wait... **" @echo "*******************************************************" mkdir -p $(prefix)/lib/java/js-test-driver && \ cd $(prefix)/lib/java/js-test-driver && \ wget http://invenio-software.org/download/js-test-driver/JsTestDriver-1.3.5.jar -O JsTestDriver.jar uninstall-js-test-driver: @rm -rvf ${prefix}/lib/java/js-test-driver @echo "*********************************************************" @echo "** The js-test-driver was successfully uninstalled. **" @echo "*********************************************************" install-jquery-plugins: @echo "***********************************************************" @echo "** Installing various jQuery plugins, please wait... **" @echo "***********************************************************" mkdir -p ${prefix}/var/www/js mkdir -p $(prefix)/var/www/css (cd ${prefix}/var/www/js && \ wget http://code.jquery.com/jquery-1.7.1.min.js && \ mv jquery-1.7.1.min.js jquery.min.js && \ wget http://ajax.googleapis.com/ajax/libs/jqueryui/1.8.17/jquery-ui.min.js && \ wget http://invenio-software.org/download/jquery/v1.5/js/jquery.jeditable.mini.js && \ wget https://raw.github.com/malsup/form/master/jquery.form.js --no-check-certificate && \ wget http://jquery-multifile-plugin.googlecode.com/svn/trunk/jquery.MultiFile.pack.js && \ wget -O jquery.tablesorter.zip http://invenio-software.org/download/jquery/jquery.tablesorter.20111208.zip && \ wget http://invenio-software.org/download/jquery/uploadify-v2.1.4.zip -O uploadify.zip && \ wget http://www.datatables.net/download/build/jquery.dataTables.min.js && \ wget http://invenio-software.org/download/jquery/jquery.bookmark.package-1.4.0.zip && \ unzip jquery.tablesorter.zip -d tablesorter && \ rm jquery.tablesorter.zip && \ rm -rf uploadify && \ unzip -u uploadify.zip -d uploadify && \ wget http://flot.googlecode.com/files/flot-0.6.zip && \ wget -O jquery-ui-timepicker-addon.js http://invenio-software.org/download/jquery/jquery-ui-timepicker-addon-1.0.3.js && \ unzip -u flot-0.6.zip && \ mv flot/jquery.flot.selection.min.js flot/jquery.flot.min.js flot/excanvas.min.js ./ && \ rm flot-0.6.zip && rm -r flot && \ mv uploadify/swfobject.js ./ && \ mv uploadify/cancel.png uploadify/uploadify.css uploadify/uploadify.allglyphs.swf uploadify/uploadify.fla uploadify/uploadify.swf ../img/ && \ mv uploadify/jquery.uploadify.v2.1.4.min.js ./jquery.uploadify.min.js && \ rm uploadify.zip && rm -r uploadify && \ wget --no-check-certificate https://github.com/douglascrockford/JSON-js/raw/master/json2.js && \ wget https://raw.github.com/jeresig/jquery.hotkeys/master/jquery.hotkeys.js --no-check-certificate && \ wget http://jquery.bassistance.de/treeview/jquery.treeview.zip && \ unzip jquery.treeview.zip -d jquery-treeview && \ rm jquery.treeview.zip && \ wget http://invenio-software.org/download/jquery/v1.5/js/jquery.ajaxPager.js && \ unzip jquery.bookmark.package-1.4.0.zip && \ rm -f jquery.bookmark.ext.* bookmarks-big.png bookmarkBasic.html jquery.bookmark.js jquery.bookmark.pack.js && \ mv bookmarks.png ../img/ && \ mv jquery.bookmark.css ../css/ && \ rm -f jquery.bookmark.package-1.4.0.zip && \ mkdir -p ${prefix}/var/www/img && \ cd ${prefix}/var/www/img && \ wget -r -np -nH --cut-dirs=4 -A "png,css" -P jquery-ui/themes http://jquery-ui.googlecode.com/svn/tags/1.8.17/themes/base/ && \ wget -r -np -nH --cut-dirs=4 -A "png,css" -P jquery-ui/themes http://jquery-ui.googlecode.com/svn/tags/1.8.17/themes/smoothness/ && \ wget -r -np -nH --cut-dirs=4 -A "png,css" -P jquery-ui/themes http://jquery-ui.googlecode.com/svn/tags/1.8.17/themes/redmond/ && \ wget --no-check-certificate -O datatables_jquery-ui.css https://github.com/DataTables/DataTables/raw/master/media/css/demo_table_jui.css && \ wget http://jquery-ui.googlecode.com/svn/tags/1.8.17/themes/redmond/jquery-ui.css && \ wget http://jquery-ui.googlecode.com/svn/tags/1.8.17/demos/images/calendar.gif && \ wget -r -np -nH --cut-dirs=5 -A "png" http://jquery-ui.googlecode.com/svn/tags/1.8.17/themes/redmond/images/) @echo "***********************************************************" @echo "** The jQuery plugins were successfully installed. **" @echo "***********************************************************" uninstall-jquery-plugins: (cd ${prefix}/var/www/js && \ rm -f jquery.min.js && \ rm -f jquery.MultiFile.pack.js && \ rm -f jquery.jeditable.mini.js && \ rm -f jquery.flot.selection.min.js && \ rm -f jquery.flot.min.js && \ rm -f excanvas.min.js && \ rm -f jquery-ui-timepicker-addon.min.js && \ rm -f json2.js && \ rm -f jquery.uploadify.min.js && \ rm -rf tablesorter && \ rm -rf jquery-treeview && \ rm -f jquery.ajaxPager.js && \ rm -f jquery.form.js && \ rm -f jquery.dataTables.min.js && \ rm -f ui.core.js && \ rm -f jquery.bookmark.min.js && \ rm -f jquery.hotkeys.js && \ rm -f jquery.tablesorter.min.js && \ rm -f jquery-ui-1.7.3.custom.min.js && \ rm -f jquery.metadata.js && \ rm -f jquery-latest.js && \ rm -f jquery-ui.min.js) (cd ${prefix}/var/www/img && \ rm -f cancel.png uploadify.css uploadify.swf uploadify.allglyphs.swf uploadify.fla && \ rm -f datatables_jquery-ui.css \ rm -f bookmarks.png) && \ (cd ${prefix}/var/www/css && \ rm -f jquery.bookmark.css) @echo "***********************************************************" @echo "** The jquery plugins were successfully uninstalled. **" @echo "***********************************************************" install-ckeditor-plugin: @echo "***********************************************************" @echo "** Installing CKeditor plugin, please wait... **" @echo "***********************************************************" rm -rf ${prefix}/lib/python/invenio/ckeditor/ rm -rf /tmp/invenio-ckeditor-plugin mkdir /tmp/invenio-ckeditor-plugin (cd /tmp/invenio-ckeditor-plugin && \ wget 'http://invenio-software.org/download/ckeditor/$(CKEDITOR)' && \ unzip -u -d ${prefix}/var/www $(CKEDITOR)) && \ find ${prefix}/var/www/ckeditor/ -depth -name '_*' -exec rm -rf {} \; && \ find ${prefix}/var/www/ckeditor/ckeditor* -maxdepth 0 ! -name "ckeditor.js" -exec rm -r {} \; && \ rm -fr /tmp/invenio-ckeditor-plugin @echo "* Installing Invenio-specific CKeditor config..." (cd $(top_srcdir)/modules/webstyle/etc && make install) @echo "***********************************************************" @echo "** The CKeditor plugin was successfully installed. **" @echo "** Please do not forget to properly set the option **" @echo "** CFG_WEBCOMMENT_USE_RICH_TEXT_EDITOR in invenio.conf. **" @echo "***********************************************************" uninstall-ckeditor-plugin: @rm -rvf ${prefix}/var/www/ckeditor @rm -rvf ${prefix}/lib/python/invenio/ckeditor @echo "***********************************************************" @echo "** The CKeditor plugin was successfully uninstalled. **" @echo "***********************************************************" install-pdfa-helper-files: @echo "***********************************************************" @echo "** Installing PDF/A helper files, please wait... **" @echo "***********************************************************" wget 'http://invenio-software.org/download/invenio-demo-site-files/ISOCoatedsb.icc' -O ${prefix}/etc/websubmit/file_converter_templates/ISOCoatedsb.icc @echo "***********************************************************" @echo "** The PDF/A helper files were successfully installed. **" @echo "***********************************************************" install-mediaelement: @echo "***********************************************************" @echo "** MediaElement.js, please wait... **" @echo "***********************************************************" rm -rf /tmp/mediaelement mkdir /tmp/mediaelement wget 'http://github.com/johndyer/mediaelement/zipball/master' -O '/tmp/mediaelement/mediaelement.zip' --no-check-certificate unzip -u -d '/tmp/mediaelement' '/tmp/mediaelement/mediaelement.zip' rm -rf ${prefix}/var/www/mediaelement mkdir ${prefix}/var/www/mediaelement mv /tmp/mediaelement/johndyer-mediaelement-*/build/* ${prefix}/var/www/mediaelement rm -rf /tmp/mediaelement @echo "***********************************************************" @echo "** MediaElement.js was successfully installed. **" @echo "***********************************************************" uninstall-pdfa-helper-files: rm -f ${prefix}/etc/websubmit/file_converter_templates/ISOCoatedsb.icc @echo "***********************************************************" @echo "** The PDF/A helper files were successfully uninstalled. **" @echo "***********************************************************" #Solrutils allows automatic installation, running and searching of an external Solr index. install-solrutils: @echo "***********************************************************" @echo "** Installing Solrutils and solr, please wait... **" @echo "***********************************************************" cd $(prefix)/lib && \ if test -d apache-solr*; then echo A solr directory already exists in `pwd` . \ Please remove it manually, if you are sure it is not needed; exit 2; fi ; \ if test -f apache-solr*; then echo solr tarball already exists in `pwd` . \ Please remove it manually.; exit 2; fi ; \ wget http://archive.apache.org/dist/lucene/solr/3.1.0/apache-solr-3.1.0.tgz && \ tar -xzf apache-solr-3.1.0.tgz && \ rm apache-solr-3.1.0.tgz cd $(solrdir)/contrib/ ;\ wget http://mirrors.ibiblio.org/pub/mirrors/maven2/com/jcraft/jzlib/1.0.7/jzlib-1.0.7.jar && \ cd $(solrdir)/contrib/ ;\ jar -xf ../example/webapps/solr.war WEB-INF/lib/lucene-core-3.1.0.jar ; \ if test -d basic-lucene-libs; then rm -rf basic-lucene-libs; fi ; \ mv WEB-INF/lib/ basic-lucene-libs ; \ cp $(solrutils_dir)/schema.xml $(solrdir)/example/solr/conf/ cp $(solrutils_dir)/solrconfig.xml $(solrdir)/example/solr/conf/ cd $(solrutils_dir) && \ javac -classpath $(CLASSPATH) -d $(solrdir)/contrib @$(solrutils_dir)/java_sources.txt && \ cd $(solrdir)/contrib/ && \ jar -cf invenio-solr.jar org/invenio_software/solr/*class update-v0.99.0-tables: cat $(top_srcdir)/modules/miscutil/sql/tabcreate.sql | grep -v 'INSERT INTO upgrade' | ${prefix}/bin/dbexec echo "DROP TABLE IF EXISTS oaiREPOSITORY;" | ${prefix}/bin/dbexec echo "ALTER TABLE bibdoc ADD COLUMN more_info mediumblob NULL default NULL;" | ${prefix}/bin/dbexec echo "ALTER TABLE schTASK ADD COLUMN priority tinyint(4) NOT NULL default 0;" | ${prefix}/bin/dbexec echo "ALTER TABLE schTASK ADD KEY priority (priority);" | ${prefix}/bin/dbexec echo "ALTER TABLE rnkCITATIONDATA DROP PRIMARY KEY;" | ${prefix}/bin/dbexec echo "ALTER TABLE rnkCITATIONDATA ADD PRIMARY KEY (id);" | ${prefix}/bin/dbexec echo "ALTER TABLE rnkCITATIONDATA CHANGE id id mediumint(8) unsigned NOT NULL auto_increment;" | ${prefix}/bin/dbexec echo "ALTER TABLE rnkCITATIONDATA ADD UNIQUE KEY object_name (object_name);" | ${prefix}/bin/dbexec echo "ALTER TABLE sbmPARAMETERS CHANGE value value text NOT NULL default '';" | ${prefix}/bin/dbexec echo "ALTER TABLE sbmAPPROVAL ADD note text NOT NULL default '';" | ${prefix}/bin/dbexec echo "ALTER TABLE hstDOCUMENT CHANGE docsize docsize bigint(15) unsigned NOT NULL;" | ${prefix}/bin/dbexec echo "ALTER TABLE cmtACTIONHISTORY CHANGE client_host client_host int(10) unsigned default NULL;" | ${prefix}/bin/dbexec update-v0.99.1-tables: @echo "Nothing to do; table structure did not change between v0.99.1 and v0.99.2." update-v0.99.2-tables: @echo "Nothing to do; table structure did not change between v0.99.2 and v0.99.3." update-v0.99.3-tables: @echo "Nothing to do; table structure did not change between v0.99.3 and v0.99.4." update-v0.99.4-tables: @echo "Nothing to do; table structure did not change between v0.99.4 and v0.99.5." update-v0.99.5-tables: @echo "Nothing to do; table structure did not change between v0.99.5 and v0.99.6." update-v0.99.6-tables: @echo "Nothing to do; table structure did not change between v0.99.6 and v0.99.7." update-v0.99.7-tables: # from v0.99.7 to v1.0.0-rc0 echo "RENAME TABLE oaiARCHIVE TO oaiREPOSITORY;" | ${prefix}/bin/dbexec cat $(top_srcdir)/modules/miscutil/sql/tabcreate.sql | grep -v 'INSERT INTO upgrade' | ${prefix}/bin/dbexec echo "INSERT INTO knwKB (id,name,description,kbtype) SELECT id,name,description,'' FROM fmtKNOWLEDGEBASES;" | ${prefix}/bin/dbexec echo "INSERT INTO knwKBRVAL (id,m_key,m_value,id_knwKB) SELECT id,m_key,m_value,id_fmtKNOWLEDGEBASES FROM fmtKNOWLEDGEBASEMAPPINGS;" | ${prefix}/bin/dbexec echo "ALTER TABLE sbmPARAMETERS CHANGE name name varchar(40) NOT NULL default '';" | ${prefix}/bin/dbexec echo "ALTER TABLE bibdoc CHANGE docname docname varchar(250) COLLATE utf8_bin NOT NULL default 'file';" | ${prefix}/bin/dbexec echo "ALTER TABLE bibdoc CHANGE status status text NOT NULL default '';" | ${prefix}/bin/dbexec echo "ALTER TABLE bibdoc ADD COLUMN text_extraction_date datetime NOT NULL default '0000-00-00';" | ${prefix}/bin/dbexec echo "ALTER TABLE collection DROP COLUMN restricted;" | ${prefix}/bin/dbexec echo "ALTER TABLE schTASK CHANGE host host varchar(255) NOT NULL default '';" | ${prefix}/bin/dbexec echo "ALTER TABLE hstTASK CHANGE host host varchar(255) NOT NULL default '';" | ${prefix}/bin/dbexec echo "ALTER TABLE bib85x DROP INDEX kv, ADD INDEX kv (value(100));" | ${prefix}/bin/dbexec echo "UPDATE clsMETHOD SET location='http://invenio-software.org/download/invenio-demo-site-files/HEP.rdf' WHERE name='HEP' AND location='';" | ${prefix}/bin/dbexec echo "UPDATE clsMETHOD SET location='http://invenio-software.org/download/invenio-demo-site-files/NASA-subjects.rdf' WHERE name='NASA-subjects' AND location='';" | ${prefix}/bin/dbexec echo "UPDATE accACTION SET name='runoairepository', description='run oairepositoryupdater task' WHERE name='runoaiarchive';" | ${prefix}/bin/dbexec echo "UPDATE accACTION SET name='cfgoaiharvest', description='configure OAI Harvest' WHERE name='cfgbibharvest';" | ${prefix}/bin/dbexec echo "ALTER TABLE accARGUMENT CHANGE value value varchar(255);" | ${prefix}/bin/dbexec echo "UPDATE accACTION SET allowedkeywords='doctype,act,categ' WHERE name='submit';" | ${prefix}/bin/dbexec echo "INSERT INTO accARGUMENT(keyword,value) VALUES ('categ','*');" | ${prefix}/bin/dbexec echo "INSERT INTO accROLE_accACTION_accARGUMENT(id_accROLE,id_accACTION,id_accARGUMENT,argumentlistid) SELECT DISTINCT raa.id_accROLE,raa.id_accACTION,accARGUMENT.id,raa.argumentlistid FROM accROLE_accACTION_accARGUMENT as raa JOIN accACTION on id_accACTION=accACTION.id,accARGUMENT WHERE accACTION.name='submit' and accARGUMENT.keyword='categ' and accARGUMENT.value='*';" | ${prefix}/bin/dbexec echo "UPDATE accACTION SET allowedkeywords='name,with_editor_rights' WHERE name='cfgwebjournal';" | ${prefix}/bin/dbexec echo "INSERT INTO accARGUMENT(keyword,value) VALUES ('with_editor_rights','yes');" | ${prefix}/bin/dbexec echo "INSERT INTO accROLE_accACTION_accARGUMENT(id_accROLE,id_accACTION,id_accARGUMENT,argumentlistid) SELECT DISTINCT raa.id_accROLE,raa.id_accACTION,accARGUMENT.id,raa.argumentlistid FROM accROLE_accACTION_accARGUMENT as raa JOIN accACTION on id_accACTION=accACTION.id,accARGUMENT WHERE accACTION.name='cfgwebjournal' and accARGUMENT.keyword='with_editor_rights' and accARGUMENT.value='yes';" | ${prefix}/bin/dbexec echo "ALTER TABLE bskEXTREC CHANGE id id int(15) unsigned NOT NULL auto_increment;" | ${prefix}/bin/dbexec echo "ALTER TABLE bskEXTREC ADD external_id int(15) NOT NULL default '0';" | ${prefix}/bin/dbexec echo "ALTER TABLE bskEXTREC ADD collection_id int(15) unsigned NOT NULL default '0';" | ${prefix}/bin/dbexec echo "ALTER TABLE bskEXTREC ADD original_url text;" | ${prefix}/bin/dbexec echo "ALTER TABLE cmtRECORDCOMMENT ADD status char(2) NOT NULL default 'ok';" | ${prefix}/bin/dbexec echo "ALTER TABLE cmtRECORDCOMMENT ADD KEY status (status);" | ${prefix}/bin/dbexec echo "INSERT INTO sbmALLFUNCDESCR VALUES ('Move_Photos_to_Storage','Attach/edit the pictures uploaded with the \"create_photos_manager_interface()\" function');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFIELDDESC VALUES ('Upload_Photos',NULL,'','R',NULL,NULL,NULL,NULL,NULL,'\"\"\"\r\nThis is an example of element that creates a photos upload interface.\r\nClone it, customize it and integrate it into your submission. Then add function \r\n\'Move_Photos_to_Storage\' to your submission functions list, in order for files \r\nuploaded with this interface to be attached to the record. More information in \r\nthe WebSubmit admin guide.\r\n\"\"\"\r\n\r\nfrom invenio.websubmit_functions.ParamFile import ParamFromFile\r\nfrom invenio.websubmit_functions.Move_Photos_to_Storage import read_param_file, create_photos_manager_interface, get_session_id\r\n\r\n# Retrieve session id\r\ntry:\r\n # User info is defined only in MBI/MPI actions...\r\n session_id = get_session_id(None, uid, user_info) \r\nexcept:\r\n session_id = get_session_id(req, uid, {})\r\n\r\n# Retrieve context\r\nindir = curdir.split(\'/\')[-3]\r\ndoctype = curdir.split(\'/\')[-2]\r\naccess = curdir.split(\'/\')[-1]\r\n\r\n# Get the record ID, if any\r\nsysno = ParamFromFile(\"%s/%s\" % (curdir,\'SN\')).strip()\r\n\r\n\"\"\"\r\nModify below the configuration of the photos manager interface.\r\nNote: \'can_reorder_photos\' parameter is not yet fully taken into consideration\r\n\r\nDocumentation of the function is available by running:\r\necho -e \'from invenio.websubmit_functions.Move_Photos_to_Storage import create_photos_manager_interface as f\\nprint f.__doc__\' | python\r\n\"\"\"\r\ntext += create_photos_manager_interface(sysno, session_id, uid,\r\n doctype, indir, curdir, access,\r\n can_delete_photos=True,\r\n can_reorder_photos=True,\r\n can_upload_photos=True,\r\n editor_width=700,\r\n editor_height=400,\r\n initial_slider_value=100,\r\n max_slider_value=200,\r\n min_slider_value=80)','0000-00-00','0000-00-00',NULL,NULL,0);" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Move_Photos_to_Storage','iconsize');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFIELDDESC VALUES ('Upload_Files',NULL,'','R',NULL,NULL,NULL,NULL,NULL,'\"\"\"\r\nThis is an example of element that creates a file upload interface.\r\nClone it, customize it and integrate it into your submission. Then add function \r\n\'Move_Uploaded_Files_to_Storage\' to your submission functions list, in order for files \r\nuploaded with this interface to be attached to the record. More information in \r\nthe WebSubmit admin guide.\r\n\"\"\"\r\nfrom invenio.websubmit_managedocfiles import create_file_upload_interface\r\nfrom invenio.websubmit_functions.Shared_Functions import ParamFromFile\r\n\r\nindir = ParamFromFile(os.path.join(curdir, \'indir\'))\r\ndoctype = ParamFromFile(os.path.join(curdir, \'doctype\'))\r\naccess = ParamFromFile(os.path.join(curdir, \'access\'))\r\ntry:\r\n sysno = int(ParamFromFile(os.path.join(curdir, \'SN\')).strip())\r\nexcept:\r\n sysno = -1\r\nln = ParamFromFile(os.path.join(curdir, \'ln\'))\r\n\r\n\"\"\"\r\nRun the following to get the list of parameters of function \'create_file_upload_interface\':\r\necho -e \'from invenio.websubmit_managedocfiles import create_file_upload_interface as f\\nprint f.__doc__\' | python\r\n\"\"\"\r\ntext = create_file_upload_interface(recid=sysno,\r\n print_outside_form_tag=False,\r\n include_headers=True,\r\n ln=ln,\r\n doctypes_and_desc=[(\'main\',\'Main document\'),\r\n (\'additional\',\'Figure, schema, etc.\')],\r\n can_revise_doctypes=[\'*\'],\r\n can_describe_doctypes=[\'main\'],\r\n can_delete_doctypes=[\'additional\'],\r\n can_rename_doctypes=[\'main\'],\r\n sbm_indir=indir, sbm_doctype=doctype, sbm_access=access)[1]\r\n','0000-00-00','0000-00-00',NULL,NULL,0);" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Move_Uploaded_Files_to_Storage','forceFileRevision');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmALLFUNCDESCR VALUES ('Create_Upload_Files_Interface','Display generic interface to add/revise/delete files. To be used before function \"Move_Uploaded_Files_to_Storage\"');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmALLFUNCDESCR VALUES ('Move_Uploaded_Files_to_Storage','Attach files uploaded with \"Create_Upload_Files_Interface\"')" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Move_Revised_Files_to_Storage','elementNameToDoctype');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Move_Revised_Files_to_Storage','createIconDoctypes');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Move_Revised_Files_to_Storage','createRelatedFormats');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Move_Revised_Files_to_Storage','iconsize');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Move_Revised_Files_to_Storage','keepPreviousVersionDoctypes');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmALLFUNCDESCR VALUES ('Move_Revised_Files_to_Storage','Revise files initially uploaded with \"Move_Files_to_Storage\"')" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Create_Upload_Files_Interface','maxsize');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Create_Upload_Files_Interface','minsize');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Create_Upload_Files_Interface','doctypes');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Create_Upload_Files_Interface','restrictions');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Create_Upload_Files_Interface','canDeleteDoctypes');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Create_Upload_Files_Interface','canReviseDoctypes');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Create_Upload_Files_Interface','canDescribeDoctypes');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Create_Upload_Files_Interface','canCommentDoctypes');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Create_Upload_Files_Interface','canKeepDoctypes');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Create_Upload_Files_Interface','canAddFormatDoctypes');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Create_Upload_Files_Interface','canRestrictDoctypes');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Create_Upload_Files_Interface','canRenameDoctypes');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Create_Upload_Files_Interface','canNameNewFiles');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Create_Upload_Files_Interface','createRelatedFormats');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Create_Upload_Files_Interface','keepDefault');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Create_Upload_Files_Interface','showLinks');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Create_Upload_Files_Interface','fileLabel');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Create_Upload_Files_Interface','filenameLabel');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Create_Upload_Files_Interface','descriptionLabel');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Create_Upload_Files_Interface','commentLabel');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Create_Upload_Files_Interface','restrictionLabel');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Create_Upload_Files_Interface','startDoc');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Create_Upload_Files_Interface','endDoc');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Create_Upload_Files_Interface','defaultFilenameDoctypes');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Create_Upload_Files_Interface','maxFilesDoctypes');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Move_Uploaded_Files_to_Storage','iconsize');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Move_Uploaded_Files_to_Storage','createIconDoctypes');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Report_Number_Generation','nblength');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Second_Report_Number_Generation','2nd_nb_length');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Get_Recid','record_search_pattern');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmALLFUNCDESCR VALUES ('Move_FCKeditor_Files_to_Storage','Transfer files attached to the record with the FCKeditor');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Move_FCKeditor_Files_to_Storage','input_fields');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Stamp_Uploaded_Files','layer');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Stamp_Replace_Single_File_Approval','layer');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Stamp_Replace_Single_File_Approval','switch_file');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Stamp_Uploaded_Files','switch_file');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Move_Files_to_Storage','paths_and_restrictions');" | ${prefix}/bin/dbexec echo "INSERT INTO sbmFUNDESC VALUES ('Move_Files_to_Storage','paths_and_doctypes');" | ${prefix}/bin/dbexec echo "ALTER TABLE cmtRECORDCOMMENT ADD round_name varchar(255) NOT NULL default ''" | ${prefix}/bin/dbexec echo "ALTER TABLE cmtRECORDCOMMENT ADD restriction varchar(50) NOT NULL default ''" | ${prefix}/bin/dbexec echo "ALTER TABLE cmtRECORDCOMMENT ADD in_reply_to_id_cmtRECORDCOMMENT int(15) unsigned NOT NULL default '0'" | ${prefix}/bin/dbexec echo "ALTER TABLE cmtRECORDCOMMENT ADD KEY in_reply_to_id_cmtRECORDCOMMENT (in_reply_to_id_cmtRECORDCOMMENT);" | ${prefix}/bin/dbexec echo "ALTER TABLE bskRECORDCOMMENT ADD in_reply_to_id_bskRECORDCOMMENT int(15) unsigned NOT NULL default '0'" | ${prefix}/bin/dbexec echo "ALTER TABLE bskRECORDCOMMENT ADD KEY in_reply_to_id_bskRECORDCOMMENT (in_reply_to_id_bskRECORDCOMMENT);" | ${prefix}/bin/dbexec echo "ALTER TABLE cmtRECORDCOMMENT ADD reply_order_cached_data blob NULL default NULL;" | ${prefix}/bin/dbexec echo "ALTER TABLE bskRECORDCOMMENT ADD reply_order_cached_data blob NULL default NULL;" | ${prefix}/bin/dbexec echo "ALTER TABLE cmtRECORDCOMMENT ADD INDEX (reply_order_cached_data(40));" | ${prefix}/bin/dbexec echo "ALTER TABLE bskRECORDCOMMENT ADD INDEX (reply_order_cached_data(40));" | ${prefix}/bin/dbexec echo -e 'from invenio.webcommentadminlib import migrate_comments_populate_threads_index;\ migrate_comments_populate_threads_index()' | $(PYTHON) echo -e 'from invenio.access_control_firerole import repair_role_definitions;\ repair_role_definitions()' | $(PYTHON) CLEANFILES = *~ *.pyc *.tmp diff --git a/configure-tests.py b/configure-tests.py index 0c729575a..1d686f9ce 100644 --- a/configure-tests.py +++ b/configure-tests.py @@ -1,470 +1,473 @@ ## This file is part of Invenio. ## Copyright (C) 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """ Test the suitability of Python core and the availability of various Python modules for running Invenio. Warn the user if there are eventual troubles. Exit status: 0 if okay, 1 if not okay. Useful for running from configure.ac. """ ## minimally recommended/required versions: cfg_min_python_version = "2.4" cfg_max_python_version = "2.9.9999" cfg_min_mysqldb_version = "1.2.1_p2" ## 0) import modules needed for this testing: import string import sys import getpass import subprocess import re error_messages = [] warning_messages = [] def wait_for_user(msg): """Print MSG and prompt user for confirmation.""" try: raw_input(msg) except KeyboardInterrupt: print "\n\nInstallation aborted." sys.exit(1) except EOFError: print " (continuing in batch mode)" return ## 1) check Python version: if sys.version < cfg_min_python_version: error_messages.append( """ ******************************************************* ** ERROR: TOO OLD PYTHON DETECTED: %s ******************************************************* ** You seem to be using a too old version of Python. ** ** You must use at least Python %s. ** ** ** ** Note that if you have more than one Python ** ** installed on your system, you can specify the ** ** --with-python configuration option to choose ** ** a specific (e.g. non system wide) Python binary. ** ** ** ** Please upgrade your Python before continuing. ** ******************************************************* """ % (string.replace(sys.version, "\n", ""), cfg_min_python_version) ) if sys.version > cfg_max_python_version: error_messages.append( """ ******************************************************* ** ERROR: TOO NEW PYTHON DETECTED: %s ******************************************************* ** You seem to be using a too new version of Python. ** ** You must use at most Python %s. ** ** ** ** Perhaps you have downloaded and are installing an ** ** old Invenio version? Please look for more recent ** ** Invenio version or please contact the development ** ** team at about this ** ** problem. ** ** ** ** Installation aborted. ** ******************************************************* """ % (string.replace(sys.version, "\n", ""), cfg_max_python_version) ) ## 2) check for required modules: try: import MySQLdb import base64 import cPickle import cStringIO import cgi import copy import fileinput import getopt import sys if sys.hexversion < 0x2060000: import md5 else: import hashlib import marshal import os import signal import tempfile import time import traceback import unicodedata import urllib import zlib import wsgiref except ImportError, msg: error_messages.append(""" ************************************************* ** IMPORT ERROR %s ************************************************* ** Perhaps you forgot to install some of the ** ** prerequisite Python modules? Please look ** ** at our INSTALL file for more details and ** ** fix the problem before continuing! ** ************************************************* """ % msg ) ## 3) check for recommended modules: try: import rdflib except ImportError, msg: warning_messages.append( """ ***************************************************** ** IMPORT WARNING %s ***************************************************** ** Note that rdflib is needed only if you plan ** ** to work with the automatic classification of ** ** documents based on RDF-based taxonomies. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg ) try: import pyRXP except ImportError, msg: warning_messages.append(""" ***************************************************** ** IMPORT WARNING %s ***************************************************** ** Note that PyRXP is not really required but ** ** we recommend it for fast XML MARC parsing. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg ) try: import dateutil except ImportError, msg: warning_messages.append(""" ***************************************************** ** IMPORT WARNING %s ***************************************************** ** Note that dateutil is not really required but ** ** we recommend it for user-friendly date ** ** parsing. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg ) try: import libxml2 except ImportError, msg: warning_messages.append(""" ***************************************************** ** IMPORT WARNING %s ***************************************************** ** Note that libxml2 is not really required but ** ** we recommend it for XML metadata conversions ** ** and for fast XML parsing. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg ) try: import libxslt except ImportError, msg: warning_messages.append( """ ***************************************************** ** IMPORT WARNING %s ***************************************************** ** Note that libxslt is not really required but ** ** we recommend it for XML metadata conversions. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg ) try: import Gnuplot except ImportError, msg: warning_messages.append( """ ***************************************************** ** IMPORT WARNING %s ***************************************************** ** Note that Gnuplot.py is not really required but ** ** we recommend it in order to have nice download ** ** and citation history graphs on Detailed record ** ** pages. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg ) try: import magic if not hasattr(magic, "open"): raise StandardError except ImportError, msg: warning_messages.append( """ ***************************************************** ** IMPORT WARNING %s ***************************************************** ** Note that magic module is not really required ** ** but we recommend it in order to have detailed ** ** content information about fulltext files. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg ) except StandardError: warning_messages.append( """ ***************************************************** ** IMPORT WARNING python-magic ***************************************************** ** The python-magic package you installed is not ** ** the one supported by Invenio. Please refer to ** ** the INSTALL file for more details. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ ) try: import reportlab except ImportError, msg: warning_messages.append( """ ***************************************************** ** IMPORT WARNING %s ***************************************************** ** Note that reportlab module is not really ** ** required, but we recommend it you want to ** ** enrich PDF with OCR information. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg ) try: - import pyPdf + try: + import PyPDF2 + except ImportError: + import pyPdf except ImportError, msg: warning_messages.append( """ ***************************************************** ** IMPORT WARNING %s ***************************************************** - ** Note that pyPdf module is not really ** + ** Note that pyPdf or pyPdf2 module is not really ** ** required, but we recommend it you want to ** ** enrich PDF with OCR information. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg ) ## 4) check for versions of some important modules: if MySQLdb.__version__ < cfg_min_mysqldb_version: error_messages.append( """ ***************************************************** ** ERROR: PYTHON MODULE MYSQLDB %s DETECTED ***************************************************** ** You have to upgrade your MySQLdb to at least ** ** version %s. You must fix this problem ** ** before continuing. Please see the INSTALL file ** ** for more details. ** ***************************************************** """ % (MySQLdb.__version__, cfg_min_mysqldb_version) ) try: import Stemmer try: from Stemmer import algorithms except ImportError, msg: error_messages.append( """ ***************************************************** ** ERROR: STEMMER MODULE PROBLEM %s ***************************************************** ** Perhaps you are using an old Stemmer version? ** ** You must either remove your old Stemmer or else ** ** upgrade to Snowball Stemmer ** ** before continuing. Please see the INSTALL file ** ** for more details. ** ***************************************************** """ % (msg) ) except ImportError: pass # no prob, Stemmer is optional ## 5) check for Python.h (needed for intbitset): try: from distutils.sysconfig import get_python_inc path_to_python_h = get_python_inc() + os.sep + 'Python.h' if not os.path.exists(path_to_python_h): raise StandardError, "Cannot find %s" % path_to_python_h except StandardError, msg: error_messages.append( """ ***************************************************** ** ERROR: PYTHON HEADER FILE ERROR %s ***************************************************** ** You do not seem to have Python developer files ** ** installed (such as Python.h). Some operating ** ** systems provide these in a separate Python ** ** package called python-dev or python-devel. ** ** You must install such a package before ** ** continuing the installation process. ** ***************************************************** """ % (msg) ) ## Check if ffmpeg is installed and if so, with the minimum configuration for bibencode try: try: process = subprocess.Popen('ffprobe', stderr=subprocess.PIPE, stdout=subprocess.PIPE) except OSError: raise StandardError, "FFMPEG/FFPROBE does not seem to be installed!" returncode = process.wait() output = process.communicate()[1] RE_CONFIGURATION = re.compile("(--enable-[a-z0-9\-]*)") CONFIGURATION_REQUIRED = ( '--enable-gpl', '--enable-version3', '--enable-nonfree', '--enable-libtheora', '--enable-libvorbis', '--enable-libvpx', '--enable-libopenjpeg' ) options = RE_CONFIGURATION.findall(output) if sys.version_info < (2, 6): import sets s = sets.Set(CONFIGURATION_REQUIRED) if not s.issubset(options): raise StandardError, options.difference(s) else: if not set(CONFIGURATION_REQUIRED).issubset(options): raise StandardError, set(CONFIGURATION_REQUIRED).difference(options) except StandardError, msg: warning_messages.append( """ ***************************************************** ** WARNING: FFMPEG CONFIGURATION MISSING %s ***************************************************** ** You do not seem to have FFmpeg configured with ** ** the minimum video codecs to run the demo site. ** ** Please install the necessary libraries and ** ** re-install FFmpeg according to the Invenio ** ** installation manual (INSTALL). ** ***************************************************** """ % (msg) ) if warning_messages: print """ ****************************************************** ** WARNING MESSAGES ** ****************************************************** """ for warning in warning_messages: print warning if error_messages: print """ ****************************************************** ** ERROR MESSAGES ** ****************************************************** """ for error in error_messages: print error if warning_messages and error_messages: print """ There were %(n_err)s error(s) found that you need to solve. Please see above, solve them, and re-run configure. Note that there are also %(n_wrn)s warnings you may want to look into. Aborting the installation. """ % {'n_wrn': len(warning_messages), 'n_err': len(error_messages)} sys.exit(1) elif error_messages: print """ There were %(n_err)s error(s) found that you need to solve. Please see above, solve them, and re-run configure. Aborting the installation. """ % {'n_err': len(error_messages)} sys.exit(1) elif warning_messages: print """ There were %(n_wrn)s warnings found that you may want to look into, solve, and re-run configure before you continue the installation. However, you can also continue the installation now and solve these issues later, if you wish. """ % {'n_wrn': len(warning_messages)} wait_for_user("Press ENTER to continue the installation...") diff --git a/modules/bibindex/lib/bibindex_engine.py b/modules/bibindex/lib/bibindex_engine.py index eda31580c..48f3110b1 100644 --- a/modules/bibindex/lib/bibindex_engine.py +++ b/modules/bibindex/lib/bibindex_engine.py @@ -1,1719 +1,1732 @@ # -*- coding: utf-8 -*- ## ## This file is part of Invenio. -## Copyright (C) 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012 CERN. +## Copyright (C) 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """ BibIndex indexing engine implementation. See bibindex executable for entry point. """ __revision__ = "$Id$" import os import re import sys import time import urllib2 import logging from invenio.config import \ CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS, \ CFG_BIBINDEX_CHARS_PUNCTUATION, \ CFG_BIBINDEX_FULLTEXT_INDEX_LOCAL_FILES_ONLY, \ CFG_BIBINDEX_AUTHOR_WORD_INDEX_EXCLUDE_FIRST_NAMES, \ CFG_BIBINDEX_SYNONYM_KBRS, \ CFG_CERN_SITE, CFG_INSPIRE_SITE, \ CFG_BIBINDEX_SPLASH_PAGES, \ CFG_SOLR_URL, \ CFG_XAPIAN_ENABLED from invenio.bibindex_engine_config import CFG_MAX_MYSQL_THREADS, \ CFG_MYSQL_THREAD_TIMEOUT, \ CFG_CHECK_MYSQL_THREADS from invenio.bibindex_engine_tokenizer import \ BibIndexFuzzyNameTokenizer, BibIndexExactNameTokenizer, \ BibIndexPairTokenizer, BibIndexWordTokenizer, \ BibIndexPhraseTokenizer from invenio.bibindexadminlib import get_idx_indexer from invenio.bibdocfile import bibdocfile_url_p, \ bibdocfile_url_to_bibdoc, normalize_format, \ download_url, guess_format_from_url, BibRecDocs, \ decompose_bibdocfile_url from invenio.websubmit_file_converter import convert_file, get_file_converter_logger from invenio.search_engine import perform_request_search, \ get_index_stemming_language, \ get_synonym_terms from invenio.dbquery import run_sql, DatabaseError, serialize_via_marshal, \ deserialize_via_marshal, wash_table_column_name from invenio.bibindex_engine_washer import wash_index_term from invenio.bibtask import task_init, write_message, get_datetime, \ task_set_option, task_get_option, task_get_task_param, \ task_update_progress, task_sleep_now_if_required from invenio.intbitset import intbitset from invenio.errorlib import register_exception from invenio.htmlutils import get_links_in_html_page from invenio.search_engine_utils import get_fieldvalues from invenio.solrutils_bibindex_indexer import solr_add_fulltext, solr_commit from invenio.xapianutils_bibindex_indexer import xapian_add +from invenio.bibrankadminlib import get_def_name if sys.hexversion < 0x2040000: # pylint: disable=W0622 from sets import Set as set # pylint: enable=W0622 # FIXME: journal tag and journal pubinfo standard format are defined here: if CFG_CERN_SITE: CFG_JOURNAL_TAG = '773__%' CFG_JOURNAL_PUBINFO_STANDARD_FORM = "773__p 773__v (773__y) 773__c" CFG_JOURNAL_PUBINFO_STANDARD_FORM_REGEXP_CHECK = r'^\w.*\s\w.*\s\(\d+\)\s\w.*$' elif CFG_INSPIRE_SITE: CFG_JOURNAL_TAG = '773__%' CFG_JOURNAL_PUBINFO_STANDARD_FORM = "773__p,773__v,773__c" CFG_JOURNAL_PUBINFO_STANDARD_FORM_REGEXP_CHECK = r'^\w.*,\w.*,\w.*$' else: CFG_JOURNAL_TAG = '909C4%' CFG_JOURNAL_PUBINFO_STANDARD_FORM = "909C4p 909C4v (909C4y) 909C4c" CFG_JOURNAL_PUBINFO_STANDARD_FORM_REGEXP_CHECK = r'^\w.*\s\w.*\s\(\d+\)\s\w.*$' ## precompile some often-used regexp for speed reasons: re_subfields = re.compile('\$\$\w') re_block_punctuation_begin = re.compile(r"^" + CFG_BIBINDEX_CHARS_PUNCTUATION + "+") re_block_punctuation_end = re.compile(CFG_BIBINDEX_CHARS_PUNCTUATION + "+$") re_punctuation = re.compile(CFG_BIBINDEX_CHARS_PUNCTUATION) re_separators = re.compile(CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS) re_datetime_shift = re.compile("([-\+]{0,1})([\d]+)([dhms])") re_arxiv = re.compile(r'^arxiv:\d\d\d\d\.\d\d\d\d') nb_char_in_line = 50 # for verbose pretty printing chunksize = 1000 # default size of chunks that the records will be treated by base_process_size = 4500 # process base size _last_word_table = None fulltext_added = intbitset() # stores ids of records whose fulltexts have been added def list_union(list1, list2): "Returns union of the two lists." union_dict = {} for e in list1: union_dict[e] = 1 for e in list2: union_dict[e] = 1 return union_dict.keys() ## safety function for killing slow DB threads: def kill_sleepy_mysql_threads(max_threads=CFG_MAX_MYSQL_THREADS, thread_timeout=CFG_MYSQL_THREAD_TIMEOUT): """Check the number of DB threads and if there are more than MAX_THREADS of them, lill all threads that are in a sleeping state for more than THREAD_TIMEOUT seconds. (This is useful for working around the the max_connection problem that appears during indexation in some not-yet-understood cases.) If some threads are to be killed, write info into the log file. """ res = run_sql("SHOW FULL PROCESSLIST") if len(res) > max_threads: for row in res: r_id, dummy, dummy, dummy, r_command, r_time, dummy, dummy = row if r_command == "Sleep" and int(r_time) > thread_timeout: run_sql("KILL %s", (r_id,)) write_message("WARNING: too many DB threads, killing thread %s" % r_id, verbose=1) return def get_associated_subfield_value(recID, tag, value, associated_subfield_code): """Return list of ASSOCIATED_SUBFIELD_CODE, if exists, for record RECID and TAG of value VALUE. Used by fulltext indexer only. Note: TAG must be 6 characters long (tag+ind1+ind2+sfcode), otherwise en empty string is returned. FIXME: what if many tag values have the same value but different associated_subfield_code? Better use bibrecord library for this. """ out = "" if len(tag) != 6: return out bibXXx = "bib" + tag[0] + tag[1] + "x" bibrec_bibXXx = "bibrec_" + bibXXx query = """SELECT bb.field_number, b.tag, b.value FROM %s AS b, %s AS bb WHERE bb.id_bibrec=%%s AND bb.id_bibxxx=b.id AND tag LIKE %%s%%""" % (bibXXx, bibrec_bibXXx) res = run_sql(query, (recID, tag[:-1])) field_number = -1 for row in res: if row[1] == tag and row[2] == value: field_number = row[0] if field_number > 0: for row in res: if row[0] == field_number and row[1] == tag[:-1] + associated_subfield_code: out = row[2] break return out def get_field_tags(field): """Returns a list of MARC tags for the field code 'field'. Returns empty list in case of error. Example: field='author', output=['100__%','700__%'].""" out = [] query = """SELECT t.value FROM tag AS t, field_tag AS ft, field AS f WHERE f.code=%s AND ft.id_field=f.id AND t.id=ft.id_tag ORDER BY ft.score DESC""" res = run_sql(query, (field,)) return [row[0] for row in res] def get_words_from_journal_tag(recID, tag): """ Special procedure to extract words from journal tags. Joins title/volume/year/page into a standard form that is also used for citations. """ # get all journal tags/subfields: bibXXx = "bib" + tag[0] + tag[1] + "x" bibrec_bibXXx = "bibrec_" + bibXXx query = """SELECT bb.field_number,b.tag,b.value FROM %s AS b, %s AS bb WHERE bb.id_bibrec=%%s AND bb.id_bibxxx=b.id AND tag LIKE %%s""" % (bibXXx, bibrec_bibXXx) res = run_sql(query, (recID, tag)) # construct journal pubinfo: dpubinfos = {} for row in res: nb_instance, subfield, value = row if subfield.endswith("c"): # delete pageend if value is pagestart-pageend # FIXME: pages may not be in 'c' subfield value = value.split('-', 1)[0] if dpubinfos.has_key(nb_instance): dpubinfos[nb_instance][subfield] = value else: dpubinfos[nb_instance] = {subfield: value} # construct standard format: lwords = [] for dpubinfo in dpubinfos.values(): # index all journal subfields separately for tag, val in dpubinfo.items(): lwords.append(val) # index journal standard format: pubinfo = CFG_JOURNAL_PUBINFO_STANDARD_FORM for tag, val in dpubinfo.items(): pubinfo = pubinfo.replace(tag, val) if CFG_JOURNAL_TAG[:-1] in pubinfo: # some subfield was missing, do nothing pass else: lwords.append(pubinfo) # return list of words and pubinfos: return lwords def get_field_count(recID, tags): """ Return number of field instances having TAGS in record RECID. @param recID: record ID @type recID: int @param tags: list of tags to count, e.g. ['100__a', '700__a'] @type tags: list @return: number of tags present in record @rtype: int @note: Works internally via getting field values, which may not be very efficient. Could use counts only, or else retrieve stored recstruct format of the record and walk through it. """ out = 0 for tag in tags: out += len(get_fieldvalues(recID, tag)) return out def get_author_canonical_ids_for_recid(recID): """ Return list of author canonical IDs (e.g. `J.Ellis.1') for the given record. Done by consulting BibAuthorID module. """ from invenio.bibauthorid_dbinterface import get_persons_from_recids lwords = [] res = get_persons_from_recids([recID]) if res is None: ## BibAuthorID is not enabled return lwords else: dpersons, dpersoninfos = res for aid in dpersoninfos.keys(): author_canonical_id = dpersoninfos[aid].get('canonical_id', '') if author_canonical_id: lwords.append(author_canonical_id) return lwords def get_words_from_date_tag(datestring, stemming_language=None): """ Special procedure to index words from tags storing date-like information in format YYYY or YYYY-MM or YYYY-MM-DD. Namely, we are indexing word-terms YYYY, YYYY-MM, YYYY-MM-DD, but never standalone MM or DD. """ out = [] for dateword in datestring.split(): # maybe there are whitespaces, so break these too out.append(dateword) parts = dateword.split('-') for nb in range(1, len(parts)): out.append("-".join(parts[:nb])) return out def get_words_from_fulltext(url_direct_or_indirect, stemming_language=None): """Returns all the words contained in the document specified by URL_DIRECT_OR_INDIRECT with the words being split by various SRE_SEPARATORS regexp set earlier. If FORCE_FILE_EXTENSION is set (e.g. to "pdf", then treat URL_DIRECT_OR_INDIRECT as a PDF file. (This is interesting to index Indico for example.) Note also that URL_DIRECT_OR_INDIRECT may be either a direct URL to the fulltext file or an URL to a setlink-like page body that presents the links to be indexed. In the latter case the URL_DIRECT_OR_INDIRECT is parsed to extract actual direct URLs to fulltext documents, for all knows file extensions as specified by global CONV_PROGRAMS config variable. """ write_message("... reading fulltext files from %s started" % url_direct_or_indirect, verbose=2) try: if bibdocfile_url_p(url_direct_or_indirect): write_message("... %s is an internal document" % url_direct_or_indirect, verbose=2) bibdoc = bibdocfile_url_to_bibdoc(url_direct_or_indirect) indexer = get_idx_indexer('fulltext') if indexer != 'native': # A document might belong to multiple records for rec_link in bibdoc.bibrec_links: recid = rec_link["recid"] # Adds fulltexts of all files once per records if not recid in fulltext_added: bibrecdocs = BibRecDocs(recid) text = bibrecdocs.get_text() if indexer == 'SOLR' and CFG_SOLR_URL: solr_add_fulltext(recid, text) elif indexer == 'XAPIAN' and CFG_XAPIAN_ENABLED: xapian_add(recid, 'fulltext', text) fulltext_added.add(recid) # we are relying on an external information retrieval system # to provide full-text indexing, so dispatch text to it and # return nothing here: return [] else: text = "" if hasattr(bibdoc, "get_text"): text = bibdoc.get_text() return get_words_from_phrase(text, stemming_language) else: if CFG_BIBINDEX_FULLTEXT_INDEX_LOCAL_FILES_ONLY: write_message("... %s is external URL but indexing only local files" % url_direct_or_indirect, verbose=2) return [] write_message("... %s is an external URL" % url_direct_or_indirect, verbose=2) urls_to_index = set() for splash_re, url_re in CFG_BIBINDEX_SPLASH_PAGES.iteritems(): if re.match(splash_re, url_direct_or_indirect): write_message("... %s is a splash page (%s)" % (url_direct_or_indirect, splash_re), verbose=2) html = urllib2.urlopen(url_direct_or_indirect).read() urls = get_links_in_html_page(html) write_message("... found these URLs in %s splash page: %s" % (url_direct_or_indirect, ", ".join(urls)), verbose=3) for url in urls: if re.match(url_re, url): write_message("... will index %s (matched by %s)" % (url, url_re), verbose=2) urls_to_index.add(url) if not urls_to_index: urls_to_index.add(url_direct_or_indirect) write_message("... will extract words from %s" % ', '.join(urls_to_index), verbose=2) words = {} for url in urls_to_index: tmpdoc = download_url(url) file_converter_logger = get_file_converter_logger() old_logging_level = file_converter_logger.getEffectiveLevel() if task_get_task_param("verbose") > 3: file_converter_logger.setLevel(logging.DEBUG) try: try: tmptext = convert_file(tmpdoc, output_format='.txt') text = open(tmptext).read() os.remove(tmptext) indexer = get_idx_indexer('fulltext') if indexer != 'native': if indexer == 'SOLR' and CFG_SOLR_URL: solr_add_fulltext(None, text) # FIXME: use real record ID if indexer == 'XAPIAN' and CFG_XAPIAN_ENABLED: #xapian_add(None, 'fulltext', text) # FIXME: use real record ID pass # we are relying on an external information retrieval system # to provide full-text indexing, so dispatch text to it and # return nothing here: tmpwords = [] else: tmpwords = get_words_from_phrase(text, stemming_language) words.update(dict(map(lambda x: (x, 1), tmpwords))) except Exception, e: message = 'ERROR: it\'s impossible to correctly extract words from %s referenced by %s: %s' % (url, url_direct_or_indirect, e) register_exception(prefix=message, alert_admin=True) write_message(message, stream=sys.stderr) finally: os.remove(tmpdoc) if task_get_task_param("verbose") > 3: file_converter_logger.setLevel(old_logging_level) return words.keys() except Exception, e: message = 'ERROR: it\'s impossible to correctly extract words from %s: %s' % (url_direct_or_indirect, e) register_exception(prefix=message, alert_admin=True) write_message(message, stream=sys.stderr) return [] def get_nothing_from_phrase(phrase, stemming_language=None): """ A dump implementation of get_words_from_phrase to be used when when a tag should not be indexed (such as when trying to extract phrases from 8564_u).""" return [] def swap_temporary_reindex_tables(index_id, reindex_prefix="tmp_"): """Atomically swap reindexed temporary table with the original one. Delete the now-old one.""" write_message("Putting new tmp index tables for id %s into production" % index_id) run_sql( "RENAME TABLE " + "idxWORD%02dR TO old_idxWORD%02dR," % (index_id, index_id) + "%sidxWORD%02dR TO idxWORD%02dR," % (reindex_prefix, index_id, index_id) + "idxWORD%02dF TO old_idxWORD%02dF," % (index_id, index_id) + "%sidxWORD%02dF TO idxWORD%02dF," % (reindex_prefix, index_id, index_id) + "idxPAIR%02dR TO old_idxPAIR%02dR," % (index_id, index_id) + "%sidxPAIR%02dR TO idxPAIR%02dR," % (reindex_prefix, index_id, index_id) + "idxPAIR%02dF TO old_idxPAIR%02dF," % (index_id, index_id) + "%sidxPAIR%02dF TO idxPAIR%02dF," % (reindex_prefix, index_id, index_id) + "idxPHRASE%02dR TO old_idxPHRASE%02dR," % (index_id, index_id) + "%sidxPHRASE%02dR TO idxPHRASE%02dR," % (reindex_prefix, index_id, index_id) + "idxPHRASE%02dF TO old_idxPHRASE%02dF," % (index_id, index_id) + "%sidxPHRASE%02dF TO idxPHRASE%02dF;" % (reindex_prefix, index_id, index_id) ) write_message("Dropping old index tables for id %s" % index_id) run_sql("DROP TABLE old_idxWORD%02dR, old_idxWORD%02dF, old_idxPAIR%02dR, old_idxPAIR%02dF, old_idxPHRASE%02dR, old_idxPHRASE%02dF" % (index_id, index_id, index_id, index_id, index_id, index_id)) # kwalitee: disable=sql def init_temporary_reindex_tables(index_id, reindex_prefix="tmp_"): """Create reindexing temporary tables.""" write_message("Creating new tmp index tables for id %s" % index_id) run_sql("""DROP TABLE IF EXISTS %sidxWORD%02dF""" % (wash_table_column_name(reindex_prefix), index_id)) # kwalitee: disable=sql run_sql("""CREATE TABLE %sidxWORD%02dF ( id mediumint(9) unsigned NOT NULL auto_increment, term varchar(50) default NULL, hitlist longblob, PRIMARY KEY (id), UNIQUE KEY term (term) ) ENGINE=MyISAM""" % (reindex_prefix, index_id)) run_sql("""DROP TABLE IF EXISTS %sidxWORD%02dR""" % (wash_table_column_name(reindex_prefix), index_id)) # kwalitee: disable=sql run_sql("""CREATE TABLE %sidxWORD%02dR ( id_bibrec mediumint(9) unsigned NOT NULL, termlist longblob, type enum('CURRENT','FUTURE','TEMPORARY') NOT NULL default 'CURRENT', PRIMARY KEY (id_bibrec,type) ) ENGINE=MyISAM""" % (reindex_prefix, index_id)) run_sql("""DROP TABLE IF EXISTS %sidxPAIR%02dF""" % (wash_table_column_name(reindex_prefix), index_id)) # kwalitee: disable=sql run_sql("""CREATE TABLE %sidxPAIR%02dF ( id mediumint(9) unsigned NOT NULL auto_increment, term varchar(100) default NULL, hitlist longblob, PRIMARY KEY (id), UNIQUE KEY term (term) ) ENGINE=MyISAM""" % (reindex_prefix, index_id)) run_sql("""DROP TABLE IF EXISTS %sidxPAIR%02dR""" % (wash_table_column_name(reindex_prefix), index_id)) # kwalitee: disable=sql run_sql("""CREATE TABLE %sidxPAIR%02dR ( id_bibrec mediumint(9) unsigned NOT NULL, termlist longblob, type enum('CURRENT','FUTURE','TEMPORARY') NOT NULL default 'CURRENT', PRIMARY KEY (id_bibrec,type) ) ENGINE=MyISAM""" % (reindex_prefix, index_id)) run_sql("""DROP TABLE IF EXISTS %sidxPHRASE%02dF""" % (wash_table_column_name(reindex_prefix), index_id)) # kwalitee: disable=sql run_sql("""CREATE TABLE %sidxPHRASE%02dF ( id mediumint(9) unsigned NOT NULL auto_increment, term text default NULL, hitlist longblob, PRIMARY KEY (id), KEY term (term(50)) ) ENGINE=MyISAM""" % (reindex_prefix, index_id)) run_sql("""DROP TABLE IF EXISTS %sidxPHRASE%02dR""" % (wash_table_column_name(reindex_prefix), index_id)) # kwalitee: disable=sql run_sql("""CREATE TABLE %sidxPHRASE%02dR ( id_bibrec mediumint(9) unsigned NOT NULL default '0', termlist longblob, type enum('CURRENT','FUTURE','TEMPORARY') NOT NULL default 'CURRENT', PRIMARY KEY (id_bibrec,type) ) ENGINE=MyISAM""" % (reindex_prefix, index_id)) run_sql("UPDATE idxINDEX SET last_updated='0000-00-00 00:00:00' WHERE id=%s", (index_id,)) def get_fuzzy_authors_from_phrase(phrase, stemming_language=None): """ Return list of fuzzy phrase-tokens suitable for storing into author phrase index. """ author_tokenizer = BibIndexFuzzyNameTokenizer() return author_tokenizer.tokenize(phrase) def get_exact_authors_from_phrase(phrase, stemming_language=None): """ Return list of exact phrase-tokens suitable for storing into exact author phrase index. """ author_tokenizer = BibIndexExactNameTokenizer() return author_tokenizer.tokenize(phrase) def get_author_family_name_words_from_phrase(phrase, stemming_language=None): """ Return list of words from author family names, not his/her first names. The phrase is assumed to be the full author name. This is useful for CFG_BIBINDEX_AUTHOR_WORD_INDEX_EXCLUDE_FIRST_NAMES. """ d_family_names = {} # first, treat everything before first comma as surname: if ',' in phrase: d_family_names[phrase.split(',', 1)[0]] = 1 # second, try fuzzy author tokenizer to find surname variants: for name in get_fuzzy_authors_from_phrase(phrase, stemming_language): if ',' in name: d_family_names[name.split(',', 1)[0]] = 1 # now extract words from these surnames: d_family_names_words = {} for family_name in d_family_names.keys(): for word in get_words_from_phrase(family_name, stemming_language): d_family_names_words[word] = 1 return d_family_names_words.keys() def get_words_from_phrase(phrase, stemming_language=None): """ Return a list of words extracted from phrase. """ words_tokenizer = BibIndexWordTokenizer(stemming_language) return words_tokenizer.tokenize(phrase) def get_phrases_from_phrase(phrase, stemming_language=None): """Return list of phrases found in PHRASE. Note that the phrase is split into groups depending on the alphanumeric characters and punctuation characters definition present in the config file. """ phrase_tokenizer = BibIndexPhraseTokenizer(stemming_language) return phrase_tokenizer.tokenize(phrase) def get_pairs_from_phrase(phrase, stemming_language=None): """ Return list of oairs extracted from phrase. """ pairs_tokenizer = BibIndexPairTokenizer(stemming_language) return pairs_tokenizer.tokenize(phrase) def remove_subfields(s): "Removes subfields from string, e.g. 'foo $$c bar' becomes 'foo bar'." return re_subfields.sub(' ', s) def get_index_id_from_index_name(index_name): """Returns the words/phrase index id for INDEXNAME. Returns empty string in case there is no words table for this index. Example: field='author', output=4.""" out = 0 query = """SELECT w.id FROM idxINDEX AS w WHERE w.name=%s LIMIT 1""" res = run_sql(query, (index_name,), 1) if res: out = res[0][0] return out def get_index_name_from_index_id(index_id): """Returns the words/phrase index name for INDEXID. Returns '' in case there is no words table for this indexid. Example: field=9, output='fulltext'.""" res = run_sql("SELECT name FROM idxINDEX WHERE id=%s", (index_id,)) if res: return res[0][0] return '' def get_index_tags(indexname): """Returns the list of tags that are indexed inside INDEXNAME. Returns empty list in case there are no tags indexed in this index. Note: uses get_field_tags() defined before. Example: field='author', output=['100__%', '700__%'].""" out = [] query = """SELECT f.code FROM idxINDEX AS w, idxINDEX_field AS wf, field AS f WHERE w.name=%s AND w.id=wf.id_idxINDEX AND f.id=wf.id_field""" res = run_sql(query, (indexname,)) for row in res: out.extend(get_field_tags(row[0])) return out def get_all_indexes(): """Returns the list of the names of all defined words indexes. Returns empty list in case there are no tags indexed in this index. Example: output=['global', 'author'].""" out = [] query = """SELECT name FROM idxINDEX""" res = run_sql(query) for row in res: out.append(row[0]) return out def split_ranges(parse_string): """Parse a string a return the list or ranges.""" recIDs = [] ranges = parse_string.split(",") for arange in ranges: tmp_recIDs = arange.split("-") if len(tmp_recIDs) == 1: recIDs.append([int(tmp_recIDs[0]), int(tmp_recIDs[0])]) else: if int(tmp_recIDs[0]) > int(tmp_recIDs[1]): # sanity check tmp = tmp_recIDs[0] tmp_recIDs[0] = tmp_recIDs[1] tmp_recIDs[1] = tmp recIDs.append([int(tmp_recIDs[0]), int(tmp_recIDs[1])]) return recIDs def get_word_tables(tables): """ Given a list of table names it return a list of tuples (index_id, index_name, index_tags). If tables is empty it returns the whole list.""" wordTables = [] if tables: indexes = tables.split(",") for index in indexes: index_id = get_index_id_from_index_name(index) if index_id: wordTables.append((index_id, index, get_index_tags(index))) else: write_message("Error: There is no %s words table." % index, sys.stderr) else: for index in get_all_indexes(): index_id = get_index_id_from_index_name(index) wordTables.append((index_id, index, get_index_tags(index))) return wordTables def get_date_range(var): "Returns the two dates contained as a low,high tuple" limits = var.split(",") if len(limits) == 1: low = get_datetime(limits[0]) return low, None if len(limits) == 2: low = get_datetime(limits[0]) high = get_datetime(limits[1]) return low, high return None, None def create_range_list(res): """Creates a range list from a recID select query result contained in res. The result is expected to have ascending numerical order.""" if not res: return [] row = res[0] if not row: return [] else: range_list = [[row, row]] for row in res[1:]: row_id = row if row_id == range_list[-1][1] + 1: range_list[-1][1] = row_id else: range_list.append([row_id, row_id]) return range_list def beautify_range_list(range_list): """Returns a non overlapping, maximal range list""" ret_list = [] for new in range_list: found = 0 for old in ret_list: if new[0] <= old[0] <= new[1] + 1 or new[0] - 1 <= old[1] <= new[1]: old[0] = min(old[0], new[0]) old[1] = max(old[1], new[1]) found = 1 break if not found: ret_list.append(new) return ret_list def truncate_index_table(index_name): """Properly truncate the given index.""" index_id = get_index_id_from_index_name(index_name) if index_id: write_message('Truncating %s index table in order to reindex.' % index_name, verbose=2) run_sql("UPDATE idxINDEX SET last_updated='0000-00-00 00:00:00' WHERE id=%s", (index_id,)) run_sql("TRUNCATE idxWORD%02dF" % index_id) # kwalitee: disable=sql run_sql("TRUNCATE idxWORD%02dR" % index_id) # kwalitee: disable=sql run_sql("TRUNCATE idxPHRASE%02dF" % index_id) # kwalitee: disable=sql run_sql("TRUNCATE idxPHRASE%02dR" % index_id) # kwalitee: disable=sql def update_index_last_updated(index_id, starting_time=None): """Update last_updated column of the index table in the database. Puts starting time there so that if the task was interrupted for record download, the records will be reindexed next time.""" if starting_time is None: return None write_message("updating last_updated to %s..." % starting_time, verbose=9) return run_sql("UPDATE idxINDEX SET last_updated=%s WHERE id=%s", (starting_time, index_id,)) +def get_percentage_completed(num_done, num_total): + """ Return a string containing the approx. percentage completed """ + percentage_remaining = 100.0 * float(num_done) / float(num_total) + if percentage_remaining: + percentage_display = "(%.1f%%)" % (percentage_remaining,) + else: + percentage_display = "" + return percentage_display + #def update_text_extraction_date(first_recid, last_recid): #"""for all the bibdoc connected to the specified recid, set #the text_extraction_date to the task_starting_time.""" #run_sql("UPDATE bibdoc JOIN bibrec_bibdoc ON id=id_bibdoc SET text_extraction_date=%s WHERE id_bibrec BETWEEN %s AND %s", (task_get_task_param('task_starting_time'), first_recid, last_recid)) class WordTable: "A class to hold the words table." def __init__(self, index_name, index_id, fields_to_index, table_name_pattern, default_get_words_fnc, tag_to_words_fnc_map, wash_index_terms=50, is_fulltext_index=False): """Creates words table instance. @param index_name: the index name @param index_id: the index integer identificator @param fields_to_index: a list of fields to index @param table_name_pattern: i.e. idxWORD%02dF or idxPHRASE%02dF @parm default_get_words_fnc: the default function called to extract words from a metadata @param tag_to_words_fnc_map: a mapping to specify particular function to extract words from particular metdata (such as 8564_u) @param wash_index_terms: do we wash index terms, and if yes (when >0), how many characters do we keep in the index terms; see max_char_length parameter of wash_index_term() """ self.index_name = index_name self.index_id = index_id self.tablename = table_name_pattern % index_id + self.humanname = get_def_name('%s' % (str(index_id),), "idxINDEX")[0][1] self.recIDs_in_mem = [] self.fields_to_index = fields_to_index self.value = {} self.stemming_language = get_index_stemming_language(index_id) self.is_fulltext_index = is_fulltext_index self.wash_index_terms = wash_index_terms # tagToFunctions mapping. It offers an indirection level necessary for # indexing fulltext. The default is get_words_from_phrase self.tag_to_words_fnc_map = tag_to_words_fnc_map self.default_get_words_fnc = default_get_words_fnc if self.stemming_language and self.tablename.startswith('idxWORD'): write_message('%s has stemming enabled, language %s' % (self.tablename, self.stemming_language)) def get_field(self, recID, tag): """Returns list of values of the MARC-21 'tag' fields for the record 'recID'.""" out = [] bibXXx = "bib" + tag[0] + tag[1] + "x" bibrec_bibXXx = "bibrec_" + bibXXx query = """SELECT value FROM %s AS b, %s AS bb WHERE bb.id_bibrec=%%s AND bb.id_bibxxx=b.id AND tag LIKE %%s""" % (bibXXx, bibrec_bibXXx) res = run_sql(query, (recID, tag)) for row in res: out.append(row[0]) return out def clean(self): "Cleans the words table." self.value = {} def put_into_db(self, mode="normal"): """Updates the current words table in the corresponding DB idxFOO table. Mode 'normal' means normal execution, mode 'emergency' means words index reverting to old state. """ write_message("%s %s wordtable flush started" % (self.tablename, mode)) write_message('...updating %d words into %s started' % \ (len(self.value), self.tablename)) - task_update_progress("%s flushed %d/%d words" % (self.tablename, 0, len(self.value))) + task_update_progress("(%s:%s) flushed %d/%d words" % (self.tablename, self.humanname, 0, len(self.value))) self.recIDs_in_mem = beautify_range_list(self.recIDs_in_mem) if mode == "normal": for group in self.recIDs_in_mem: query = """UPDATE %sR SET type='TEMPORARY' WHERE id_bibrec BETWEEN %%s AND %%s AND type='CURRENT'""" % self.tablename[:-1] write_message(query % (group[0], group[1]), verbose=9) run_sql(query, (group[0], group[1])) nb_words_total = len(self.value) nb_words_report = int(nb_words_total / 10.0) nb_words_done = 0 for word in self.value.keys(): self.put_word_into_db(word) nb_words_done += 1 if nb_words_report != 0 and ((nb_words_done % nb_words_report) == 0): write_message('......processed %d/%d words' % (nb_words_done, nb_words_total)) - task_update_progress("%s flushed %d/%d words" % (self.tablename, nb_words_done, nb_words_total)) + percentage_display = get_percentage_completed(nb_words_done, nb_words_total) + task_update_progress("(%s:%s) flushed %d/%d words %s" % (self.tablename, self.humanname, nb_words_done, nb_words_total, percentage_display)) write_message('...updating %d words into %s ended' % \ (nb_words_total, self.tablename)) write_message('...updating reverse table %sR started' % self.tablename[:-1]) if mode == "normal": for group in self.recIDs_in_mem: query = """UPDATE %sR SET type='CURRENT' WHERE id_bibrec BETWEEN %%s AND %%s AND type='FUTURE'""" % self.tablename[:-1] write_message(query % (group[0], group[1]), verbose=9) run_sql(query, (group[0], group[1])) query = """DELETE FROM %sR WHERE id_bibrec BETWEEN %%s AND %%s AND type='TEMPORARY'""" % self.tablename[:-1] write_message(query % (group[0], group[1]), verbose=9) run_sql(query, (group[0], group[1])) #if self.is_fulltext_index: #update_text_extraction_date(group[0], group[1]) write_message('End of updating wordTable into %s' % self.tablename, verbose=9) elif mode == "emergency": for group in self.recIDs_in_mem: query = """UPDATE %sR SET type='CURRENT' WHERE id_bibrec BETWEEN %%s AND %%s AND type='TEMPORARY'""" % self.tablename[:-1] write_message(query % (group[0], group[1]), verbose=9) run_sql(query, (group[0], group[1])) query = """DELETE FROM %sR WHERE id_bibrec BETWEEN %%s AND %%s AND type='FUTURE'""" % self.tablename[:-1] write_message(query % (group[0], group[1]), verbose=9) run_sql(query, (group[0], group[1])) write_message('End of emergency flushing wordTable into %s' % self.tablename, verbose=9) write_message('...updating reverse table %sR ended' % self.tablename[:-1]) self.clean() self.recIDs_in_mem = [] write_message("%s %s wordtable flush ended" % (self.tablename, mode)) - task_update_progress("%s flush ended" % (self.tablename)) + task_update_progress("(%s:%s) flush ended" % (self.tablename, self.humanname)) def load_old_recIDs(self, word): """Load existing hitlist for the word from the database index files.""" query = "SELECT hitlist FROM %s WHERE term=%%s" % self.tablename res = run_sql(query, (word,)) if res: return intbitset(res[0][0]) else: return None def merge_with_old_recIDs(self, word, set): """Merge the system numbers stored in memory (hash of recIDs with value +1 or -1 according to whether to add/delete them) with those stored in the database index and received in set universe of recIDs for the given word. Return False in case no change was done to SET, return True in case SET was changed. """ oldset = intbitset(set) set.update_with_signs(self.value[word]) return set != oldset def put_word_into_db(self, word): """Flush a single word to the database and delete it from memory""" set = self.load_old_recIDs(word) if set is not None: # merge the word recIDs found in memory: if not self.merge_with_old_recIDs(word, set): # nothing to update: write_message("......... unchanged hitlist for ``%s''" % word, verbose=9) pass else: # yes there were some new words: write_message("......... updating hitlist for ``%s''" % word, verbose=9) run_sql("UPDATE %s SET hitlist=%%s WHERE term=%%s" % wash_table_column_name(self.tablename), (set.fastdump(), word)) # kwalitee: disable=sql else: # the word is new, will create new set: write_message("......... inserting hitlist for ``%s''" % word, verbose=9) set = intbitset(self.value[word].keys()) try: run_sql("INSERT INTO %s (term, hitlist) VALUES (%%s, %%s)" % wash_table_column_name(self.tablename), (word, set.fastdump())) # kwalitee: disable=sql except Exception, e: ## We send this exception to the admin only when is not ## already reparing the problem. register_exception(prefix="Error when putting the term '%s' into db (hitlist=%s): %s\n" % (repr(word), set, e), alert_admin=(task_get_option('cmd') != 'repair')) if not set: # never store empty words run_sql("DELETE FROM %s WHERE term=%%s" % wash_table_column_name(self.tablename), (word,)) # kwalitee: disable=sql del self.value[word] def display(self): "Displays the word table." keys = self.value.keys() keys.sort() for k in keys: write_message("%s: %s" % (k, self.value[k])) def count(self): "Returns the number of words in the table." return len(self.value) def info(self): "Prints some information on the words table." write_message("The words table contains %d words." % self.count()) def lookup_words(self, word=""): "Lookup word from the words table." if not word: done = 0 while not done: try: word = raw_input("Enter word: ") done = 1 except (EOFError, KeyboardInterrupt): return if self.value.has_key(word): write_message("The word '%s' is found %d times." \ % (word, len(self.value[word]))) else: write_message("The word '%s' does not exist in the word file."\ % word) def add_recIDs(self, recIDs, opt_flush): """Fetches records which id in the recIDs range list and adds them to the wordTable. The recIDs range list is of the form: [[i1_low,i1_high],[i2_low,i2_high], ..., [iN_low,iN_high]]. """ global chunksize, _last_word_table flush_count = 0 records_done = 0 records_to_go = 0 for arange in recIDs: records_to_go = records_to_go + arange[1] - arange[0] + 1 time_started = time.time() # will measure profile time for arange in recIDs: i_low = arange[0] chunksize_count = 0 while i_low <= arange[1]: task_sleep_now_if_required() # calculate chunk group of recIDs and treat it: i_high = min(i_low + opt_flush - flush_count - 1, arange[1]) i_high = min(i_low + chunksize - chunksize_count - 1, i_high) try: self.chk_recID_range(i_low, i_high) except StandardError: if self.index_name == 'fulltext' and CFG_SOLR_URL: solr_commit() raise write_message("%s adding records #%d-#%d started" % \ (self.tablename, i_low, i_high)) if CFG_CHECK_MYSQL_THREADS: kill_sleepy_mysql_threads() - task_update_progress("%s adding recs %d-%d" % (self.tablename, i_low, i_high)) + percentage_display = get_percentage_completed(records_done, records_to_go) + task_update_progress("(%s:%s) adding recs %d-%d %s" % (self.tablename, self.humanname, i_low, i_high, percentage_display)) self.del_recID_range(i_low, i_high) just_processed = self.add_recID_range(i_low, i_high) flush_count = flush_count + i_high - i_low + 1 chunksize_count = chunksize_count + i_high - i_low + 1 records_done = records_done + just_processed write_message("%s adding records #%d-#%d ended " % \ (self.tablename, i_low, i_high)) if chunksize_count >= chunksize: chunksize_count = 0 # flush if necessary: if flush_count >= opt_flush: self.put_into_db() self.clean() if self.index_name == 'fulltext' and CFG_SOLR_URL: solr_commit() write_message("%s backing up" % (self.tablename)) flush_count = 0 self.log_progress(time_started, records_done, records_to_go) # iterate: i_low = i_high + 1 if flush_count > 0: self.put_into_db() if self.index_name == 'fulltext' and CFG_SOLR_URL: solr_commit() self.log_progress(time_started, records_done, records_to_go) def add_recIDs_by_date(self, dates, opt_flush): """Add records that were modified between DATES[0] and DATES[1]. If DATES is not set, then add records that were modified since the last update of the index. """ if not dates: table_id = self.tablename[-3:-1] query = """SELECT last_updated FROM idxINDEX WHERE id=%s""" res = run_sql(query, (table_id,)) if not res: return if not res[0][0]: dates = ("0000-00-00", None) else: dates = (res[0][0], None) if dates[1] is None: res = intbitset(run_sql("""SELECT b.id FROM bibrec AS b WHERE b.modification_date >= %s""", (dates[0],))) if self.is_fulltext_index: res |= intbitset(run_sql("""SELECT id_bibrec FROM bibrec_bibdoc JOIN bibdoc ON id_bibdoc=id WHERE text_extraction_date <= modification_date AND modification_date >= %s AND status<>'DELETED'""", (dates[0],))) elif dates[0] is None: res = intbitset(run_sql("""SELECT b.id FROM bibrec AS b WHERE b.modification_date <= %s""", (dates[1],))) if self.is_fulltext_index: res |= intbitset(run_sql("""SELECT id_bibrec FROM bibrec_bibdoc JOIN bibdoc ON id_bibdoc=id WHERE text_extraction_date <= modification_date AND modification_date <= %s AND status<>'DELETED'""", (dates[1],))) else: res = intbitset(run_sql("""SELECT b.id FROM bibrec AS b WHERE b.modification_date >= %s AND b.modification_date <= %s""", (dates[0], dates[1]))) if self.is_fulltext_index: res |= intbitset(run_sql("""SELECT id_bibrec FROM bibrec_bibdoc JOIN bibdoc ON id_bibdoc=id WHERE text_extraction_date <= modification_date AND modification_date >= %s AND modification_date <= %s AND status<>'DELETED'""", (dates[0], dates[1],))) alist = create_range_list(list(res)) if not alist: write_message("No new records added. %s is up to date" % self.tablename) else: self.add_recIDs(alist, opt_flush) # special case of author indexes where we need to re-index # those records that were affected by changed BibAuthorID # attributions: if self.index_name in ('author', 'firstauthor', 'exactauthor', 'exactfirstauthor'): from invenio.bibauthorid_personid_maintenance import get_recids_affected_since # dates[1] is ignored, since BibAuthorID API does not offer upper limit search alist = create_range_list(get_recids_affected_since(dates[0])) if not alist: write_message("No new records added by author canonical IDs. %s is up to date" % self.tablename) else: self.add_recIDs(alist, opt_flush) def add_recID_range(self, recID1, recID2): """Add records from RECID1 to RECID2.""" wlist = {} self.recIDs_in_mem.append([recID1, recID2]) # special case of author indexes where we also add author # canonical IDs: if self.index_name in ('author', 'firstauthor', 'exactauthor', 'exactfirstauthor'): for recID in range(recID1, recID2 + 1): if not wlist.has_key(recID): wlist[recID] = [] wlist[recID] = list_union(get_author_canonical_ids_for_recid(recID), wlist[recID]) # special case of journal index: if self.fields_to_index == [CFG_JOURNAL_TAG]: # FIXME: quick hack for the journal index; a special # treatment where we need to associate more than one # subfield into indexed term for recID in range(recID1, recID2 + 1): new_words = get_words_from_journal_tag(recID, self.fields_to_index[0]) if not wlist.has_key(recID): wlist[recID] = [] wlist[recID] = list_union(new_words, wlist[recID]) elif self.index_name in ('authorcount',): # FIXME: quick hack for the authorcount index; we have to # count the number of author fields only for recID in range(recID1, recID2 + 1): new_words = [str(get_field_count(recID, self.fields_to_index)),] if not wlist.has_key(recID): wlist[recID] = [] wlist[recID] = list_union(new_words, wlist[recID]) else: # usual tag-by-tag indexing: for tag in self.fields_to_index: get_words_function = self.tag_to_words_fnc_map.get(tag, self.default_get_words_fnc) bibXXx = "bib" + tag[0] + tag[1] + "x" bibrec_bibXXx = "bibrec_" + bibXXx query = """SELECT bb.id_bibrec,b.value FROM %s AS b, %s AS bb WHERE bb.id_bibrec BETWEEN %%s AND %%s AND bb.id_bibxxx=b.id AND tag LIKE %%s""" % (bibXXx, bibrec_bibXXx) res = run_sql(query, (recID1, recID2, tag)) if tag == '8564_u': ## FIXME: Quick hack to be sure that hidden files are ## actually indexed. res = set(res) for recid in xrange(int(recID1), int(recID2) + 1): for bibdocfile in BibRecDocs(recid).list_latest_files(): res.add((recid, bibdocfile.get_url())) for row in sorted(res): recID, phrase = row if not wlist.has_key(recID): wlist[recID] = [] new_words = get_words_function(phrase, stemming_language=self.stemming_language) # ,self.separators wlist[recID] = list_union(new_words, wlist[recID]) # lookup index-time synonyms: if CFG_BIBINDEX_SYNONYM_KBRS.has_key(self.index_name): if len(wlist) == 0: return 0 recIDs = wlist.keys() for recID in recIDs: for word in wlist[recID]: word_synonyms = get_synonym_terms(word, CFG_BIBINDEX_SYNONYM_KBRS[self.index_name][0], CFG_BIBINDEX_SYNONYM_KBRS[self.index_name][1]) if word_synonyms: wlist[recID] = list_union(word_synonyms, wlist[recID]) # were there some words for these recIDs found? if len(wlist) == 0: return 0 recIDs = wlist.keys() for recID in recIDs: # was this record marked as deleted? if "DELETED" in self.get_field(recID, "980__c"): wlist[recID] = [] write_message("... record %d was declared deleted, removing its word list" % recID, verbose=9) write_message("... record %d, termlist: %s" % (recID, wlist[recID]), verbose=9) # put words into reverse index table with FUTURE status: for recID in recIDs: run_sql("INSERT INTO %sR (id_bibrec,termlist,type) VALUES (%%s,%%s,'FUTURE')" % wash_table_column_name(self.tablename[:-1]), (recID, serialize_via_marshal(wlist[recID]))) # kwalitee: disable=sql # ... and, for new records, enter the CURRENT status as empty: try: run_sql("INSERT INTO %sR (id_bibrec,termlist,type) VALUES (%%s,%%s,'CURRENT')" % wash_table_column_name(self.tablename[:-1]), (recID, serialize_via_marshal([]))) # kwalitee: disable=sql except DatabaseError: # okay, it's an already existing record, no problem pass # put words into memory word list: put = self.put for recID in recIDs: for w in wlist[recID]: put(recID, w, 1) return len(recIDs) def log_progress(self, start, done, todo): """Calculate progress and store it. start: start time, done: records processed, todo: total number of records""" time_elapsed = time.time() - start # consistency check if time_elapsed == 0 or done > todo: return time_recs_per_min = done / (time_elapsed / 60.0) write_message("%d records took %.1f seconds to complete.(%1.f recs/min)"\ % (done, time_elapsed, time_recs_per_min)) if time_recs_per_min: write_message("Estimated runtime: %.1f minutes" % \ ((todo - done) / time_recs_per_min)) def put(self, recID, word, sign): """Adds/deletes a word to the word list.""" try: if self.wash_index_terms: word = wash_index_term(word, self.wash_index_terms) if self.value.has_key(word): # the word 'word' exist already: update sign self.value[word][recID] = sign else: self.value[word] = {recID: sign} except: write_message("Error: Cannot put word %s with sign %d for recID %s." % (word, sign, recID)) def del_recIDs(self, recIDs): """Fetches records which id in the recIDs range list and adds them to the wordTable. The recIDs range list is of the form: [[i1_low,i1_high],[i2_low,i2_high], ..., [iN_low,iN_high]]. """ count = 0 for arange in recIDs: task_sleep_now_if_required() self.del_recID_range(arange[0], arange[1]) count = count + arange[1] - arange[0] self.put_into_db() if self.index_name == 'fulltext' and CFG_SOLR_URL: solr_commit() def del_recID_range(self, low, high): """Deletes records with 'recID' system number between low and high from memory words index table.""" write_message("%s fetching existing words for records #%d-#%d started" % \ (self.tablename, low, high), verbose=3) self.recIDs_in_mem.append([low, high]) query = """SELECT id_bibrec,termlist FROM %sR as bb WHERE bb.id_bibrec BETWEEN %%s AND %%s""" % (self.tablename[:-1]) recID_rows = run_sql(query, (low, high)) for recID_row in recID_rows: recID = recID_row[0] wlist = deserialize_via_marshal(recID_row[1]) for word in wlist: self.put(recID, word, -1) write_message("%s fetching existing words for records #%d-#%d ended" % \ (self.tablename, low, high), verbose=3) def report_on_table_consistency(self): """Check reverse words index tables (e.g. idxWORD01R) for interesting states such as 'TEMPORARY' state. Prints small report (no of words, no of bad words). """ # find number of words: query = """SELECT COUNT(*) FROM %s""" % (self.tablename) res = run_sql(query, None, 1) if res: nb_words = res[0][0] else: nb_words = 0 # find number of records: query = """SELECT COUNT(DISTINCT(id_bibrec)) FROM %sR""" % (self.tablename[:-1]) res = run_sql(query, None, 1) if res: nb_records = res[0][0] else: nb_records = 0 # report stats: write_message("%s contains %d words from %d records" % (self.tablename, nb_words, nb_records)) # find possible bad states in reverse tables: query = """SELECT COUNT(DISTINCT(id_bibrec)) FROM %sR WHERE type <> 'CURRENT'""" % (self.tablename[:-1]) res = run_sql(query) if res: nb_bad_records = res[0][0] else: nb_bad_records = 999999999 if nb_bad_records: write_message("EMERGENCY: %s needs to repair %d of %d index records" % \ (self.tablename, nb_bad_records, nb_records)) else: write_message("%s is in consistent state" % (self.tablename)) return nb_bad_records def repair(self, opt_flush): """Repair the whole table""" # find possible bad states in reverse tables: query = """SELECT COUNT(DISTINCT(id_bibrec)) FROM %sR WHERE type <> 'CURRENT'""" % (self.tablename[:-1]) res = run_sql(query, None, 1) if res: nb_bad_records = res[0][0] else: nb_bad_records = 0 if nb_bad_records == 0: return query = """SELECT id_bibrec FROM %sR WHERE type <> 'CURRENT'""" \ % (self.tablename[:-1]) res = intbitset(run_sql(query)) recIDs = create_range_list(list(res)) flush_count = 0 records_done = 0 records_to_go = 0 for arange in recIDs: records_to_go = records_to_go + arange[1] - arange[0] + 1 time_started = time.time() # will measure profile time for arange in recIDs: i_low = arange[0] chunksize_count = 0 while i_low <= arange[1]: task_sleep_now_if_required() # calculate chunk group of recIDs and treat it: i_high = min(i_low + opt_flush - flush_count - 1, arange[1]) i_high = min(i_low + chunksize - chunksize_count - 1, i_high) self.fix_recID_range(i_low, i_high) flush_count = flush_count + i_high - i_low + 1 chunksize_count = chunksize_count + i_high - i_low + 1 records_done = records_done + i_high - i_low + 1 if chunksize_count >= chunksize: chunksize_count = 0 # flush if necessary: if flush_count >= opt_flush: self.put_into_db("emergency") self.clean() flush_count = 0 self.log_progress(time_started, records_done, records_to_go) # iterate: i_low = i_high + 1 if flush_count > 0: self.put_into_db("emergency") self.log_progress(time_started, records_done, records_to_go) write_message("%s inconsistencies repaired." % self.tablename) def chk_recID_range(self, low, high): """Check if the reverse index table is in proper state""" ## check db query = """SELECT COUNT(*) FROM %sR WHERE type <> 'CURRENT' AND id_bibrec BETWEEN %%s AND %%s""" % self.tablename[:-1] res = run_sql(query, (low, high), 1) if res[0][0] == 0: write_message("%s for %d-%d is in consistent state" % (self.tablename, low, high)) return # okay, words table is consistent ## inconsistency detected! write_message("EMERGENCY: %s inconsistencies detected..." % self.tablename) error_message = "Errors found. You should check consistency of the " \ "%s - %sR tables.\nRunning 'bibindex --repair' is " \ "recommended." % (self.tablename, self.tablename[:-1]) write_message("EMERGENCY: " + error_message, stream=sys.stderr) raise StandardError(error_message) def fix_recID_range(self, low, high): """Try to fix reverse index database consistency (e.g. table idxWORD01R) in the low,high doc-id range. Possible states for a recID follow: CUR TMP FUT: very bad things have happened: warn! CUR TMP : very bad things have happened: warn! CUR FUT: delete FUT (crash before flushing) CUR : database is ok TMP FUT: add TMP to memory and del FUT from memory flush (revert to old state) TMP : very bad things have happened: warn! FUT: very bad things have happended: warn! """ state = {} query = "SELECT id_bibrec,type FROM %sR WHERE id_bibrec BETWEEN %%s AND %%s"\ % self.tablename[:-1] res = run_sql(query, (low, high)) for row in res: if not state.has_key(row[0]): state[row[0]] = [] state[row[0]].append(row[1]) ok = 1 # will hold info on whether we will be able to repair for recID in state.keys(): if not 'TEMPORARY' in state[recID]: if 'FUTURE' in state[recID]: if 'CURRENT' not in state[recID]: write_message("EMERGENCY: Index record %d is in inconsistent state. Can't repair it." % recID) ok = 0 else: write_message("EMERGENCY: Inconsistency in index record %d detected" % recID) query = """DELETE FROM %sR WHERE id_bibrec=%%s""" % self.tablename[:-1] run_sql(query, (recID,)) write_message("EMERGENCY: Inconsistency in record %d repaired." % recID) else: if 'FUTURE' in state[recID] and not 'CURRENT' in state[recID]: self.recIDs_in_mem.append([recID, recID]) # Get the words file query = """SELECT type,termlist FROM %sR WHERE id_bibrec=%%s""" % self.tablename[:-1] write_message(query, verbose=9) res = run_sql(query, (recID,)) for row in res: wlist = deserialize_via_marshal(row[1]) write_message("Words are %s " % wlist, verbose=9) if row[0] == 'TEMPORARY': sign = 1 else: sign = -1 for word in wlist: self.put(recID, word, sign) else: write_message("EMERGENCY: %s for %d is in inconsistent " "state. Couldn't repair it." % (self.tablename, recID), stream=sys.stderr) ok = 0 if not ok: error_message = "Unrepairable errors found. You should check " \ "consistency of the %s - %sR tables. Deleting affected " \ "TEMPORARY and FUTURE entries from these tables is " \ "recommended; see the BibIndex Admin Guide." % \ (self.tablename, self.tablename[:-1]) write_message("EMERGENCY: " + error_message, stream=sys.stderr) raise StandardError(error_message) def main(): """Main that construct all the bibtask.""" task_init(authorization_action='runbibindex', authorization_msg="BibIndex Task Submission", description="""Examples: \t%s -a -i 234-250,293,300-500 -u admin@localhost \t%s -a -w author,fulltext -M 8192 -v3 \t%s -d -m +4d -A on --flush=10000\n""" % ((sys.argv[0],) * 3), help_specific_usage=""" Indexing options: -a, --add\t\tadd or update words for selected records -d, --del\t\tdelete words for selected records -i, --id=low[-high]\t\tselect according to doc recID -m, --modified=from[,to]\tselect according to modification date -c, --collection=c1[,c2]\tselect according to collection -R, --reindex\treindex the selected indexes from scratch Repairing options: -k, --check\t\tcheck consistency for all records in the table(s) -r, --repair\t\ttry to repair all records in the table(s) Specific options: -w, --windex=w1[,w2]\tword/phrase indexes to consider (all) -M, --maxmem=XXX\tmaximum memory usage in kB (no limit) -f, --flush=NNN\t\tfull consistent table flush after NNN records (10000) """, version=__revision__, specific_params=("adi:m:c:w:krRM:f:", [ "add", "del", "id=", "modified=", "collection=", "windex=", "check", "repair", "reindex", "maxmem=", "flush=", ]), task_stop_helper_fnc=task_stop_table_close_fnc, task_submit_elaborate_specific_parameter_fnc=task_submit_elaborate_specific_parameter, task_run_fnc=task_run_core, task_submit_check_options_fnc=task_submit_check_options) def task_submit_check_options(): """Check for options compatibility.""" if task_get_option("reindex"): if task_get_option("cmd") != "add" or task_get_option('id') or task_get_option('collection'): print >> sys.stderr, "ERROR: You can use --reindex only when adding modified record." return False return True def task_submit_elaborate_specific_parameter(key, value, opts, args): """ Given the string key it checks it's meaning, eventually using the value. Usually it fills some key in the options dict. It must return True if it has elaborated the key, False, if it doesn't know that key. eg: if key in ['-n', '--number']: self.options['number'] = value return True return False """ if key in ("-a", "--add"): task_set_option("cmd", "add") if ("-x", "") in opts or ("--del", "") in opts: raise StandardError("Can not have --add and --del at the same time!") elif key in ("-k", "--check"): task_set_option("cmd", "check") elif key in ("-r", "--repair"): task_set_option("cmd", "repair") elif key in ("-d", "--del"): task_set_option("cmd", "del") elif key in ("-i", "--id"): task_set_option('id', task_get_option('id') + split_ranges(value)) elif key in ("-m", "--modified"): task_set_option("modified", get_date_range(value)) elif key in ("-c", "--collection"): task_set_option("collection", value) elif key in ("-R", "--reindex"): task_set_option("reindex", True) elif key in ("-w", "--windex"): task_set_option("windex", value) elif key in ("-M", "--maxmem"): task_set_option("maxmem", int(value)) if task_get_option("maxmem") < base_process_size + 1000: raise StandardError("Memory usage should be higher than %d kB" % \ (base_process_size + 1000)) elif key in ("-f", "--flush"): task_set_option("flush", int(value)) else: return False return True def task_stop_table_close_fnc(): """ Close tables to STOP. """ global _last_word_table if _last_word_table: _last_word_table.put_into_db() def task_run_core(): """Runs the task by fetching arguments from the BibSched task queue. This is what BibSched will be invoking via daemon call. The task prints Fibonacci numbers for up to NUM on the stdout, and some messages on stderr. Return 1 in case of success and 0 in case of failure.""" global _last_word_table if task_get_option("cmd") == "check": wordTables = get_word_tables(task_get_option("windex")) for index_id, index_name, index_tags in wordTables: if index_name == 'year' and CFG_INSPIRE_SITE: fnc_get_words_from_phrase = get_words_from_date_tag elif index_name in ('author', 'firstauthor') and \ CFG_BIBINDEX_AUTHOR_WORD_INDEX_EXCLUDE_FIRST_NAMES: fnc_get_words_from_phrase = get_author_family_name_words_from_phrase else: fnc_get_words_from_phrase = get_words_from_phrase wordTable = WordTable(index_name=index_name, index_id=index_id, fields_to_index=index_tags, table_name_pattern='idxWORD%02dF', default_get_words_fnc=fnc_get_words_from_phrase, tag_to_words_fnc_map={'8564_u': get_words_from_fulltext}, wash_index_terms=50) _last_word_table = wordTable wordTable.report_on_table_consistency() task_sleep_now_if_required(can_stop_too=True) if index_name in ('author', 'firstauthor') and \ CFG_BIBINDEX_AUTHOR_WORD_INDEX_EXCLUDE_FIRST_NAMES: fnc_get_pairs_from_phrase = get_pairs_from_phrase # FIXME else: fnc_get_pairs_from_phrase = get_pairs_from_phrase wordTable = WordTable(index_name=index_name, index_id=index_id, fields_to_index=index_tags, table_name_pattern='idxPAIR%02dF', default_get_words_fnc=fnc_get_pairs_from_phrase, tag_to_words_fnc_map={'8564_u': get_nothing_from_phrase}, wash_index_terms=100) _last_word_table = wordTable wordTable.report_on_table_consistency() task_sleep_now_if_required(can_stop_too=True) if index_name in ('author', 'firstauthor'): fnc_get_phrases_from_phrase = get_fuzzy_authors_from_phrase elif index_name in ('exactauthor', 'exactfirstauthor'): fnc_get_phrases_from_phrase = get_exact_authors_from_phrase else: fnc_get_phrases_from_phrase = get_phrases_from_phrase wordTable = WordTable(index_name=index_name, index_id=index_id, fields_to_index=index_tags, table_name_pattern='idxPHRASE%02dF', default_get_words_fnc=fnc_get_phrases_from_phrase, tag_to_words_fnc_map={'8564_u': get_nothing_from_phrase}, wash_index_terms=0) _last_word_table = wordTable wordTable.report_on_table_consistency() task_sleep_now_if_required(can_stop_too=True) _last_word_table = None return True # Let's work on single words! wordTables = get_word_tables(task_get_option("windex")) for index_id, index_name, index_tags in wordTables: is_fulltext_index = index_name == 'fulltext' reindex_prefix = "" if task_get_option("reindex"): reindex_prefix = "tmp_" init_temporary_reindex_tables(index_id, reindex_prefix) if index_name == 'year' and CFG_INSPIRE_SITE: fnc_get_words_from_phrase = get_words_from_date_tag elif index_name in ('author', 'firstauthor') and \ CFG_BIBINDEX_AUTHOR_WORD_INDEX_EXCLUDE_FIRST_NAMES: fnc_get_words_from_phrase = get_author_family_name_words_from_phrase else: fnc_get_words_from_phrase = get_words_from_phrase wordTable = WordTable(index_name=index_name, index_id=index_id, fields_to_index=index_tags, table_name_pattern=reindex_prefix + 'idxWORD%02dF', default_get_words_fnc=fnc_get_words_from_phrase, tag_to_words_fnc_map={'8564_u': get_words_from_fulltext}, is_fulltext_index=is_fulltext_index, wash_index_terms=50) _last_word_table = wordTable wordTable.report_on_table_consistency() try: if task_get_option("cmd") == "del": if task_get_option("id"): wordTable.del_recIDs(task_get_option("id")) task_sleep_now_if_required(can_stop_too=True) elif task_get_option("collection"): l_of_colls = task_get_option("collection").split(",") recIDs = perform_request_search(c=l_of_colls) recIDs_range = [] for recID in recIDs: recIDs_range.append([recID, recID]) wordTable.del_recIDs(recIDs_range) task_sleep_now_if_required(can_stop_too=True) else: error_message = "Missing IDs of records to delete from " \ "index %s." % wordTable.tablename write_message(error_message, stream=sys.stderr) raise StandardError(error_message) elif task_get_option("cmd") == "add": if task_get_option("id"): wordTable.add_recIDs(task_get_option("id"), task_get_option("flush")) task_sleep_now_if_required(can_stop_too=True) elif task_get_option("collection"): l_of_colls = task_get_option("collection").split(",") recIDs = perform_request_search(c=l_of_colls) recIDs_range = [] for recID in recIDs: recIDs_range.append([recID, recID]) wordTable.add_recIDs(recIDs_range, task_get_option("flush")) task_sleep_now_if_required(can_stop_too=True) else: wordTable.add_recIDs_by_date(task_get_option("modified"), task_get_option("flush")) ## here we used to update last_updated info, if run via automatic mode; ## but do not update here anymore, since idxPHRASE will be acted upon later task_sleep_now_if_required(can_stop_too=True) elif task_get_option("cmd") == "repair": wordTable.repair(task_get_option("flush")) task_sleep_now_if_required(can_stop_too=True) else: error_message = "Invalid command found processing %s" % \ wordTable.tablename write_message(error_message, stream=sys.stderr) raise StandardError(error_message) except StandardError, e: write_message("Exception caught: %s" % e, sys.stderr) register_exception(alert_admin=True) if _last_word_table: _last_word_table.put_into_db() raise wordTable.report_on_table_consistency() task_sleep_now_if_required(can_stop_too=True) # Let's work on pairs now if index_name in ('author', 'firstauthor') and \ CFG_BIBINDEX_AUTHOR_WORD_INDEX_EXCLUDE_FIRST_NAMES: fnc_get_pairs_from_phrase = get_pairs_from_phrase # FIXME else: fnc_get_pairs_from_phrase = get_pairs_from_phrase wordTable = WordTable(index_name=index_name, index_id=index_id, fields_to_index=index_tags, table_name_pattern=reindex_prefix + 'idxPAIR%02dF', default_get_words_fnc=fnc_get_pairs_from_phrase, tag_to_words_fnc_map={'8564_u': get_nothing_from_phrase}, wash_index_terms=100) _last_word_table = wordTable wordTable.report_on_table_consistency() try: if task_get_option("cmd") == "del": if task_get_option("id"): wordTable.del_recIDs(task_get_option("id")) task_sleep_now_if_required(can_stop_too=True) elif task_get_option("collection"): l_of_colls = task_get_option("collection").split(",") recIDs = perform_request_search(c=l_of_colls) recIDs_range = [] for recID in recIDs: recIDs_range.append([recID, recID]) wordTable.del_recIDs(recIDs_range) task_sleep_now_if_required(can_stop_too=True) else: error_message = "Missing IDs of records to delete from " \ "index %s." % wordTable.tablename write_message(error_message, stream=sys.stderr) raise StandardError(error_message) elif task_get_option("cmd") == "add": if task_get_option("id"): wordTable.add_recIDs(task_get_option("id"), task_get_option("flush")) task_sleep_now_if_required(can_stop_too=True) elif task_get_option("collection"): l_of_colls = task_get_option("collection").split(",") recIDs = perform_request_search(c=l_of_colls) recIDs_range = [] for recID in recIDs: recIDs_range.append([recID, recID]) wordTable.add_recIDs(recIDs_range, task_get_option("flush")) task_sleep_now_if_required(can_stop_too=True) else: wordTable.add_recIDs_by_date(task_get_option("modified"), task_get_option("flush")) # let us update last_updated timestamp info, if run via automatic mode: task_sleep_now_if_required(can_stop_too=True) elif task_get_option("cmd") == "repair": wordTable.repair(task_get_option("flush")) task_sleep_now_if_required(can_stop_too=True) else: error_message = "Invalid command found processing %s" % \ wordTable.tablename write_message(error_message, stream=sys.stderr) raise StandardError(error_message) except StandardError, e: write_message("Exception caught: %s" % e, sys.stderr) register_exception() if _last_word_table: _last_word_table.put_into_db() raise wordTable.report_on_table_consistency() task_sleep_now_if_required(can_stop_too=True) # Let's work on phrases now if index_name in ('author', 'firstauthor'): fnc_get_phrases_from_phrase = get_fuzzy_authors_from_phrase elif index_name in ('exactauthor', 'exactfirstauthor'): fnc_get_phrases_from_phrase = get_exact_authors_from_phrase else: fnc_get_phrases_from_phrase = get_phrases_from_phrase wordTable = WordTable(index_name=index_name, index_id=index_id, fields_to_index=index_tags, table_name_pattern=reindex_prefix + 'idxPHRASE%02dF', default_get_words_fnc=fnc_get_phrases_from_phrase, tag_to_words_fnc_map={'8564_u': get_nothing_from_phrase}, wash_index_terms=0) _last_word_table = wordTable wordTable.report_on_table_consistency() try: if task_get_option("cmd") == "del": if task_get_option("id"): wordTable.del_recIDs(task_get_option("id")) task_sleep_now_if_required(can_stop_too=True) elif task_get_option("collection"): l_of_colls = task_get_option("collection").split(",") recIDs = perform_request_search(c=l_of_colls) recIDs_range = [] for recID in recIDs: recIDs_range.append([recID, recID]) wordTable.del_recIDs(recIDs_range) task_sleep_now_if_required(can_stop_too=True) else: error_message = "Missing IDs of records to delete from " \ "index %s." % wordTable.tablename write_message(error_message, stream=sys.stderr) raise StandardError(error_message) elif task_get_option("cmd") == "add": if task_get_option("id"): wordTable.add_recIDs(task_get_option("id"), task_get_option("flush")) task_sleep_now_if_required(can_stop_too=True) elif task_get_option("collection"): l_of_colls = task_get_option("collection").split(",") recIDs = perform_request_search(c=l_of_colls) recIDs_range = [] for recID in recIDs: recIDs_range.append([recID, recID]) wordTable.add_recIDs(recIDs_range, task_get_option("flush")) task_sleep_now_if_required(can_stop_too=True) else: wordTable.add_recIDs_by_date(task_get_option("modified"), task_get_option("flush")) # let us update last_updated timestamp info, if run via automatic mode: update_index_last_updated(index_id, task_get_task_param('task_starting_time')) task_sleep_now_if_required(can_stop_too=True) elif task_get_option("cmd") == "repair": wordTable.repair(task_get_option("flush")) task_sleep_now_if_required(can_stop_too=True) else: error_message = "Invalid command found processing %s" % \ wordTable.tablename write_message(error_message, stream=sys.stderr) raise StandardError(error_message) except StandardError, e: write_message("Exception caught: %s" % e, sys.stderr) register_exception() if _last_word_table: _last_word_table.put_into_db() raise wordTable.report_on_table_consistency() task_sleep_now_if_required(can_stop_too=True) if task_get_option("reindex"): swap_temporary_reindex_tables(index_id, reindex_prefix) update_index_last_updated(index_id, task_get_task_param('task_starting_time')) task_sleep_now_if_required(can_stop_too=True) _last_word_table = None return True ### okay, here we go: if __name__ == '__main__': main() diff --git a/modules/bibsched/lib/bibsched.py b/modules/bibsched/lib/bibsched.py index 8b7260212..2912bfe59 100644 --- a/modules/bibsched/lib/bibsched.py +++ b/modules/bibsched/lib/bibsched.py @@ -1,1827 +1,1827 @@ # -*- coding: utf-8 -*- ## ## This file is part of Invenio. ## Copyright (C) 2006, 2007, 2008, 2009, 2010, 2011, 2012 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """BibSched - task management, scheduling and executing system for Invenio """ __revision__ = "$Id$" import os import sys import time import re import marshal import getopt from itertools import chain from socket import gethostname from subprocess import Popen import signal from invenio.bibtask_config import \ CFG_BIBTASK_VALID_TASKS, \ CFG_BIBTASK_MONOTASKS, \ CFG_BIBTASK_FIXEDTIMETASKS from invenio.config import \ CFG_PREFIX, \ CFG_BIBSCHED_REFRESHTIME, \ CFG_BIBSCHED_LOG_PAGER, \ CFG_BIBSCHED_EDITOR, \ CFG_BINDIR, \ CFG_LOGDIR, \ CFG_BIBSCHED_GC_TASKS_OLDER_THAN, \ CFG_BIBSCHED_GC_TASKS_TO_REMOVE, \ CFG_BIBSCHED_GC_TASKS_TO_ARCHIVE, \ CFG_BIBSCHED_MAX_NUMBER_CONCURRENT_TASKS, \ CFG_SITE_URL, \ CFG_BIBSCHED_NODE_TASKS, \ CFG_BIBSCHED_MAX_ARCHIVED_ROWS_DISPLAY from invenio.dbquery import run_sql, real_escape_string from invenio.textutils import wrap_text_in_a_box from invenio.errorlib import register_exception, register_emergency from invenio.shellutils import run_shell_command CFG_VALID_STATUS = ('WAITING', 'SCHEDULED', 'RUNNING', 'CONTINUING', '% DELETED', 'ABOUT TO STOP', 'ABOUT TO SLEEP', 'STOPPED', 'SLEEPING', 'KILLED', 'NOW STOP', 'ERRORS REPORTED') CFG_MOTD_PATH = os.path.join(CFG_PREFIX, "var", "run", "bibsched.motd") SHIFT_RE = re.compile("([-\+]{0,1})([\d]+)([dhms])") class RecoverableError(StandardError): pass def get_pager(): """ Return the first available pager. """ paths = ( os.environ.get('PAGER', ''), CFG_BIBSCHED_LOG_PAGER, '/usr/bin/less', '/bin/more' ) for pager in paths: if os.path.exists(pager): return pager def get_editor(): """ Return the first available editor. """ paths = ( os.environ.get('EDITOR', ''), CFG_BIBSCHED_EDITOR, '/usr/bin/vim', '/usr/bin/emacs', '/usr/bin/vi', '/usr/bin/nano', ) for editor in paths: if os.path.exists(editor): return editor def get_datetime(var, format_string="%Y-%m-%d %H:%M:%S"): """Returns a date string according to the format string. It can handle normal date strings and shifts with respect to now.""" try: date = time.time() factors = {"d": 24*3600, "h": 3600, "m": 60, "s": 1} m = SHIFT_RE.match(var) if m: sign = m.groups()[0] == "-" and -1 or 1 factor = factors[m.groups()[2]] value = float(m.groups()[1]) date = time.localtime(date + sign * factor * value) date = time.strftime(format_string, date) else: date = time.strptime(var, format_string) date = time.strftime(format_string, date) return date except: return None def get_my_pid(process, args=''): if sys.platform.startswith('freebsd'): command = "ps -o pid,args | grep '%s %s' | grep -v 'grep' | sed -n 1p" % (process, args) else: command = "ps -C %s o '%%p%%a' | grep '%s %s' | grep -v 'grep' | sed -n 1p" % (process, process, args) answer = run_shell_command(command)[1].strip() if answer == '': answer = 0 else: answer = answer[:answer.find(' ')] return int(answer) def get_task_pid(task_name, task_id, ignore_error=False): """Return the pid of task_name/task_id""" try: path = os.path.join(CFG_PREFIX, 'var', 'run', 'bibsched_task_%d.pid' % task_id) pid = int(open(path).read()) os.kill(pid, signal.SIGUSR2) return pid except (OSError, IOError): if ignore_error: return 0 register_exception() return get_my_pid(task_name, str(task_id)) def get_last_taskid(): """Return the last taskid used.""" return run_sql("SELECT MAX(id) FROM schTASK")[0][0] def delete_task(task_id): """Delete the corresponding task.""" run_sql("DELETE FROM schTASK WHERE id=%s", (task_id, )) def is_task_scheduled(task_name): """Check if a certain task_name is due for execution (WAITING or RUNNING)""" sql = """SELECT COUNT(proc) FROM schTASK WHERE proc = %s AND (status='WAITING' OR status='RUNNING')""" return run_sql(sql, (task_name,))[0][0] > 0 def get_task_ids_by_descending_date(task_name, statuses=['SCHEDULED']): """Returns list of task ids, ordered by descending runtime.""" sql = """SELECT id FROM schTASK WHERE proc=%s AND (%s) ORDER BY runtime DESC""" \ % " OR ".join(["status = '%s'" % x for x in statuses]) return [x[0] for x in run_sql(sql, (task_name,))] def get_task_options(task_id): """Returns options for task_id read from the BibSched task queue table.""" res = run_sql("SELECT arguments FROM schTASK WHERE id=%s", (task_id,)) try: return marshal.loads(res[0][0]) except IndexError: return list() def gc_tasks(verbose=False, statuses=None, since=None, tasks=None): # pylint: disable=W0613 """Garbage collect the task queue.""" if tasks is None: tasks = CFG_BIBSCHED_GC_TASKS_TO_REMOVE + CFG_BIBSCHED_GC_TASKS_TO_ARCHIVE if since is None: since = '-%id' % CFG_BIBSCHED_GC_TASKS_OLDER_THAN if statuses is None: statuses = ['DONE'] statuses = [status.upper() for status in statuses if status.upper() != 'RUNNING'] date = get_datetime(since) status_query = 'status in (%s)' % ','.join([repr(real_escape_string(status)) for status in statuses]) for task in tasks: if task in CFG_BIBSCHED_GC_TASKS_TO_REMOVE: res = run_sql("""DELETE FROM schTASK WHERE proc=%%s AND %s AND runtime<%%s""" % status_query, (task, date)) write_message('Deleted %s %s tasks (created before %s) with %s' % (res, task, date, status_query)) elif task in CFG_BIBSCHED_GC_TASKS_TO_ARCHIVE: run_sql("""INSERT INTO hstTASK(id,proc,host,user, runtime,sleeptime,arguments,status,progress) SELECT id,proc,host,user, runtime,sleeptime,arguments,status,progress FROM schTASK WHERE proc=%%s AND %s AND runtime<%%s""" % status_query, (task, date)) res = run_sql("""DELETE FROM schTASK WHERE proc=%%s AND %s AND runtime<%%s""" % status_query, (task, date)) write_message('Archived %s %s tasks (created before %s) with %s' % (res, task, date, status_query)) def spawn_task(command, wait=False): """ Spawn the provided command in a way that is detached from the current group. In this way a signal received by bibsched is not going to be automatically propagated to the spawned process. """ def preexec(): # Don't forward signals. os.setsid() devnull = open(os.devnull, "w") process = Popen(command, preexec_fn=preexec, shell=True, stderr=devnull, stdout=devnull) if wait: process.wait() def bibsched_get_host(task_id): """Retrieve the hostname of the task""" res = run_sql("SELECT host FROM schTASK WHERE id=%s LIMIT 1", (task_id, ), 1) if res: return res[0][0] def bibsched_set_host(task_id, host=""): """Update the progress of task_id.""" return run_sql("UPDATE schTASK SET host=%s WHERE id=%s", (host, task_id)) def bibsched_get_status(task_id): """Retrieve the task status.""" res = run_sql("SELECT status FROM schTASK WHERE id=%s LIMIT 1", (task_id, ), 1) if res: return res[0][0] def bibsched_set_status(task_id, status, when_status_is=None): """Update the status of task_id.""" if when_status_is is None: return run_sql("UPDATE schTASK SET status=%s WHERE id=%s", (status, task_id)) else: return run_sql("UPDATE schTASK SET status=%s WHERE id=%s AND status=%s", (status, task_id, when_status_is)) def bibsched_set_progress(task_id, progress): """Update the progress of task_id.""" return run_sql("UPDATE schTASK SET progress=%s WHERE id=%s", (progress, task_id)) def bibsched_set_priority(task_id, priority): """Update the priority of task_id.""" return run_sql("UPDATE schTASK SET priority=%s WHERE id=%s", (priority, task_id)) def bibsched_send_signal(proc, task_id, sig): """Send a signal to a given task.""" if bibsched_get_host(task_id) != gethostname(): return False pid = get_task_pid(proc, task_id, True) if pid: try: os.kill(pid, sig) return True except OSError: return False return False def is_monotask(task_id, proc, runtime, status, priority, host, sequenceid): # pylint: disable=W0613 procname = proc.split(':')[0] return procname in CFG_BIBTASK_MONOTASKS def stop_task(other_task_id, other_proc, other_priority, other_status, other_sequenceid): # pylint: disable=W0613 Log("Send STOP signal to #%d (%s) which was in status %s" % (other_task_id, other_proc, other_status)) bibsched_set_status(other_task_id, 'ABOUT TO STOP', other_status) def sleep_task(other_task_id, other_proc, other_priority, other_status, other_sequenceid): # pylint: disable=W0613 Log("Send SLEEP signal to #%d (%s) which was in status %s" % (other_task_id, other_proc, other_status)) bibsched_set_status(other_task_id, 'ABOUT TO SLEEP', other_status) class Manager(object): def __init__(self, old_stdout): import curses import curses.panel from curses.wrapper import wrapper self.old_stdout = old_stdout self.curses = curses self.helper_modules = CFG_BIBTASK_VALID_TASKS self.running = 1 self.footer_auto_mode = "Automatic Mode [A Manual] [1/2/3 Display] [P Purge] [l/L Log] [O Opts] [E Edit motd] [Q Quit]" self.footer_select_mode = "Manual Mode [A Automatic] [1/2/3 Display Type] [P Purge] [l/L Log] [O Opts] [E Edit motd] [Q Quit]" self.footer_waiting_item = "[R Run] [D Delete] [N Priority]" self.footer_running_item = "[S Sleep] [T Stop] [K Kill]" self.footer_stopped_item = "[I Initialise] [D Delete] [K Acknowledge]" self.footer_sleeping_item = "[W Wake Up] [T Stop] [K Kill]" self.item_status = "" self.rows = [] self.panel = None self.display = 2 self.first_visible_line = 0 self.auto_mode = 0 self.currentrow = None self.current_attr = 0 self.hostname = gethostname() self.allowed_task_types = CFG_BIBSCHED_NODE_TASKS.get(self.hostname, CFG_BIBTASK_VALID_TASKS) self.motd = "" self.header_lines = 2 self.read_motd() self.selected_line = self.header_lines wrapper(self.start) def read_motd(self): """Get a fresh motd from disk, if it exists.""" self.motd = "" self.header_lines = 2 try: if os.path.exists(CFG_MOTD_PATH): motd = open(CFG_MOTD_PATH).read().strip() if motd: self.motd = "MOTD [%s] " % time.strftime("%Y-%m-%d %H:%M", time.localtime(os.path.getmtime(CFG_MOTD_PATH))) + motd self.header_lines = 3 except IOError: pass def handle_keys(self, char): if char == -1: return if self.auto_mode and (char not in (self.curses.KEY_UP, self.curses.KEY_DOWN, self.curses.KEY_PPAGE, self.curses.KEY_NPAGE, ord("g"), ord("G"), ord("n"), ord("q"), ord("Q"), ord("a"), ord("A"), ord("1"), ord("2"), ord("3"), ord("p"), ord("P"), ord("o"), ord("O"), ord("l"), ord("L"), ord("e"), ord("E"))): self.display_in_footer("in automatic mode") else: status = self.currentrow and self.currentrow[5] or None if char == self.curses.KEY_UP: self.selected_line = max(self.selected_line - 1, self.header_lines) self.repaint() if char == self.curses.KEY_PPAGE: self.selected_line = max(self.selected_line - 10, self.header_lines) self.repaint() elif char == self.curses.KEY_DOWN: self.selected_line = min(self.selected_line + 1, len(self.rows) + self.header_lines - 1) self.repaint() elif char == self.curses.KEY_NPAGE: self.selected_line = min(self.selected_line + 10, len(self.rows) + self.header_lines - 1) self.repaint() elif char == self.curses.KEY_HOME: self.first_visible_line = 0 self.selected_line = self.header_lines elif char == ord("g"): self.selected_line = self.header_lines self.repaint() elif char == ord("G"): self.selected_line = len(self.rows) + self.header_lines - 1 self.repaint() elif char in (ord("a"), ord("A")): self.change_auto_mode() elif char == ord("l"): self.openlog() elif char == ord("L"): self.openlog(err=True) elif char in (ord("w"), ord("W")): self.wakeup() elif char in (ord("n"), ord("N")): self.change_priority() elif char in (ord("r"), ord("R")): if status in ('WAITING', 'SCHEDULED'): self.run() elif char in (ord("s"), ord("S")): self.sleep() elif char in (ord("k"), ord("K")): if status in ('ERROR', 'DONE WITH ERRORS', 'ERRORS REPORTED'): self.acknowledge() elif status is not None: self.kill() elif char in (ord("t"), ord("T")): self.stop() elif char in (ord("d"), ord("D")): self.delete() elif char in (ord("i"), ord("I")): self.init() elif char in (ord("p"), ord("P")): self.purge_done() elif char in (ord("o"), ord("O")): self.display_task_options() elif char in (ord("e"), ord("E")): self.edit_motd() self.read_motd() elif char == ord("1"): self.display = 1 self.first_visible_line = 0 self.selected_line = self.header_lines # We need to update the display to display done tasks self.update_rows() self.repaint() self.display_in_footer("only done processes are displayed") elif char == ord("2"): self.display = 2 self.first_visible_line = 0 self.selected_line = self.header_lines # We need to update the display to display not done tasks self.update_rows() self.repaint() self.display_in_footer("only not done processes are displayed") elif char == ord("3"): self.display = 3 self.first_visible_line = 0 self.selected_line = self.header_lines # We need to update the display to display archived tasks self.update_rows() self.repaint() self.display_in_footer("only archived processes are displayed") elif char in (ord("q"), ord("Q")): if self.curses.panel.top_panel() == self.panel: self.panel = None self.curses.panel.update_panels() else: self.running = 0 return def openlog(self, err=False): task_id = self.currentrow[0] if err: logname = os.path.join(CFG_LOGDIR, 'bibsched_task_%d.err' % task_id) else: logname = os.path.join(CFG_LOGDIR, 'bibsched_task_%d.log' % task_id) if os.path.exists(logname): pager = get_pager() if os.path.exists(pager): self.curses.endwin() os.system('%s %s' % (pager, logname)) print >> self.old_stdout, "\rPress ENTER to continue", self.old_stdout.flush() raw_input() # We need to redraw the bibsched task list # since we are displaying "Press ENTER to continue" self.repaint() else: self._display_message_box("No pager was found") def edit_motd(self): """Add, delete or change the motd message that will be shown when the bibsched monitor starts.""" editor = get_editor() if editor: previous = self.motd self.curses.endwin() os.system("%s %s" % (editor, CFG_MOTD_PATH)) # We need to redraw the MOTD part self.read_motd() self.repaint() if previous[24:] != self.motd[24:]: if len(previous) == 0: Log('motd set to "%s"' % self.motd.replace("\n", "|")) self.selected_line += 1 self.header_lines += 1 elif len(self.motd) == 0: Log('motd deleted') self.selected_line -= 1 self.header_lines -= 1 else: Log('motd changed to "%s"' % self.motd.replace("\n", "|")) else: self._display_message_box("No editor was found") def display_task_options(self): """Nicely display information about current process.""" msg = ' id: %i\n\n' % self.currentrow[0] pid = get_task_pid(self.currentrow[1], self.currentrow[0], True) if pid is not None: msg += ' pid: %s\n\n' % pid msg += ' priority: %s\n\n' % self.currentrow[8] msg += ' proc: %s\n\n' % self.currentrow[1] msg += ' user: %s\n\n' % self.currentrow[2] msg += ' runtime: %s\n\n' % self.currentrow[3].strftime("%Y-%m-%d %H:%M:%S") msg += ' sleeptime: %s\n\n' % self.currentrow[4] msg += ' status: %s\n\n' % self.currentrow[5] msg += ' progress: %s\n\n' % self.currentrow[6] arguments = marshal.loads(self.currentrow[7]) if type(arguments) is dict: # FIXME: REMOVE AFTER MAJOR RELEASE 1.0 msg += ' options : %s\n\n' % arguments else: msg += 'executable : %s\n\n' % arguments[0] msg += ' arguments : %s\n\n' % ' '.join(arguments[1:]) msg += '\n\nPress q to quit this panel...' msg = wrap_text_in_a_box(msg, style='no_border') rows = msg.split('\n') height = len(rows) + 2 width = max([len(row) for row in rows]) + 4 try: self.win = self.curses.newwin( height, width, (self.height - height) / 2 + 1, (self.width - width) / 2 + 1 ) except self.curses.error: return self.panel = self.curses.panel.new_panel(self.win) self.panel.top() self.win.border() i = 1 for row in rows: self.win.addstr(i, 2, row, self.current_attr) i += 1 self.win.refresh() while self.win.getkey() != 'q': pass self.panel = None def count_processes(self, status): out = 0 res = run_sql("""SELECT COUNT(id) FROM schTASK WHERE status=%s GROUP BY status""", (status,)) try: out = res[0][0] except: pass return out def change_priority(self): task_id = self.currentrow[0] priority = self.currentrow[8] new_priority = self._display_ask_number_box("Insert the desired \ priority for task %s. The smaller the number the less the priority. Note that \ a number less than -10 will mean to always postpone the task while a number \ bigger than 10 will mean some tasks with less priority could be stopped in \ order to let this task run. The current priority is %s. New value:" % (task_id, priority)) try: new_priority = int(new_priority) except ValueError: return bibsched_set_priority(task_id, new_priority) # We need to update the tasks list with our new priority # to be able to display it self.update_rows() # We need to update the priority number next to the task self.repaint() def wakeup(self): task_id = self.currentrow[0] process = self.currentrow[1] status = self.currentrow[5] #if self.count_processes('RUNNING') + self.count_processes('CONTINUING') >= 1: #self.display_in_footer("a process is already running!") if status == "SLEEPING": if not bibsched_send_signal(process, task_id, signal.SIGCONT): bibsched_set_status(task_id, "ERROR", "SLEEPING") self.update_rows() self.repaint() self.display_in_footer("process woken up") else: self.display_in_footer("process is not sleeping") self.stdscr.refresh() def _display_YN_box(self, msg): """Utility to display confirmation boxes.""" msg += ' (Y/N)' msg = wrap_text_in_a_box(msg, style='no_border') rows = msg.split('\n') height = len(rows) + 2 width = max([len(row) for row in rows]) + 4 self.win = self.curses.newwin( height, width, (self.height - height) / 2 + 1, (self.width - width) / 2 + 1 ) self.panel = self.curses.panel.new_panel(self.win) self.panel.top() self.win.border() i = 1 for row in rows: self.win.addstr(i, 2, row, self.current_attr) i += 1 self.win.refresh() try: while 1: c = self.win.getch() if c in (ord('y'), ord('Y')): return True elif c in (ord('n'), ord('N')): return False finally: self.panel = None def _display_ask_number_box(self, msg): """Utility to display confirmation boxes.""" msg = wrap_text_in_a_box(msg, style='no_border') rows = msg.split('\n') height = len(rows) + 3 width = max([len(row) for row in rows]) + 4 self.win = self.curses.newwin( height, width, (self.height - height) / 2 + 1, (self.width - width) / 2 + 1 ) self.panel = self.curses.panel.new_panel(self.win) self.panel.top() self.win.border() i = 1 for row in rows: self.win.addstr(i, 2, row, self.current_attr) i += 1 self.win.refresh() self.win.move(height - 2, 2) self.curses.echo() ret = self.win.getstr() self.curses.noecho() self.panel = None return ret def _display_message_box(self, msg): """Utility to display message boxes.""" rows = msg.split('\n') height = len(rows) + 2 width = max([len(row) for row in rows]) + 3 self.win = self.curses.newwin( height, width, (self.height - height) / 2 + 1, (self.width - width) / 2 + 1 ) self.panel = self.curses.panel.new_panel(self.win) self.panel.top() self.win.border() i = 1 for row in rows: self.win.addstr(i, 2, row, self.current_attr) i += 1 self.win.refresh() self.win.move(height - 2, 2) self.win.getkey() self.curses.noecho() self.panel = None def purge_done(self): """Garbage collector.""" if self._display_YN_box( "You are going to purge the list of DONE tasks.\n\n" "%s tasks, submitted since %s days, will be archived.\n\n" "%s tasks, submitted since %s days, will be deleted.\n\n" "Are you sure?" % ( ', '.join(CFG_BIBSCHED_GC_TASKS_TO_ARCHIVE), CFG_BIBSCHED_GC_TASKS_OLDER_THAN, ', '.join(CFG_BIBSCHED_GC_TASKS_TO_REMOVE), CFG_BIBSCHED_GC_TASKS_OLDER_THAN)): gc_tasks() # We removed some tasks from our list self.update_rows() self.repaint() self.display_in_footer("DONE processes purged") def run(self): task_id = self.currentrow[0] process = self.currentrow[1].split(':')[0] status = self.currentrow[5] if status == "WAITING": if process in self.helper_modules: if run_sql("""UPDATE schTASK SET status='SCHEDULED', host=%s WHERE id=%s and status='WAITING'""", (self.hostname, task_id)): program = os.path.join(CFG_BINDIR, process) command = "%s %s" % (program, str(task_id)) spawn_task(command) Log("manually running task #%d (%s)" % (task_id, process)) # We changed the status of one of our tasks self.update_rows() self.repaint() else: ## Process already running (typing too quickly on the keyboard?) pass else: self.display_in_footer("Process %s is not in the list of allowed processes." % process) else: self.display_in_footer("Process status should be SCHEDULED or WAITING!") def acknowledge(self): task_id = self.currentrow[0] status = self.currentrow[5] if status in ('ERROR', 'DONE WITH ERRORS', 'ERRORS REPORTED'): bibsched_set_status(task_id, 'ACK ' + status, status) self.update_rows() self.repaint() self.display_in_footer("Acknowledged error") def sleep(self): task_id = self.currentrow[0] status = self.currentrow[5] if status in ('RUNNING', 'CONTINUING'): bibsched_set_status(task_id, 'ABOUT TO SLEEP', status) self.update_rows() self.repaint() self.display_in_footer("SLEEP signal sent to task #%s" % task_id) else: self.display_in_footer("Cannot put to sleep non-running processes") def kill(self): task_id = self.currentrow[0] process = self.currentrow[1] status = self.currentrow[5] if status in ('RUNNING', 'CONTINUING', 'ABOUT TO STOP', 'ABOUT TO SLEEP', 'SLEEPING'): if self._display_YN_box("Are you sure you want to kill the %s process %s?" % (process, task_id)): bibsched_send_signal(process, task_id, signal.SIGKILL) bibsched_set_status(task_id, 'KILLED') self.update_rows() self.repaint() self.display_in_footer("KILL signal sent to task #%s" % task_id) else: self.display_in_footer("Cannot kill non-running processes") def stop(self): task_id = self.currentrow[0] process = self.currentrow[1] status = self.currentrow[5] if status in ('RUNNING', 'CONTINUING', 'ABOUT TO SLEEP', 'SLEEPING'): if status == 'SLEEPING': bibsched_set_status(task_id, 'NOW STOP', 'SLEEPING') bibsched_send_signal(process, task_id, signal.SIGCONT) count = 10 while bibsched_get_status(task_id) == 'NOW STOP': if count <= 0: bibsched_set_status(task_id, 'ERROR', 'NOW STOP') self.update_rows() self.repaint() self.display_in_footer("It seems impossible to wakeup this task.") return time.sleep(CFG_BIBSCHED_REFRESHTIME) count -= 1 else: bibsched_set_status(task_id, 'ABOUT TO STOP', status) self.update_rows() self.repaint() self.display_in_footer("STOP signal sent to task #%s" % task_id) else: self.display_in_footer("Cannot stop non-running processes") def delete(self): task_id = self.currentrow[0] status = self.currentrow[5] if status not in ('RUNNING', 'CONTINUING', 'SLEEPING', 'SCHEDULED', 'ABOUT TO STOP', 'ABOUT TO SLEEP'): bibsched_set_status(task_id, "%s_DELETED" % status, status) self.display_in_footer("process deleted") self.update_rows() self.repaint() else: self.display_in_footer("Cannot delete running processes") def init(self): task_id = self.currentrow[0] status = self.currentrow[5] if status not in ('RUNNING', 'CONTINUING', 'SLEEPING'): bibsched_set_status(task_id, "WAITING") bibsched_set_progress(task_id, "") bibsched_set_host(task_id, "") self.update_rows() self.repaint() self.display_in_footer("process initialised") else: self.display_in_footer("Cannot initialise running processes") def change_auto_mode(self): program = os.path.join(CFG_BINDIR, "bibsched") if self.auto_mode: COMMAND = "%s -q halt" % program else: COMMAND = "%s -q start" % program os.system(COMMAND) self.auto_mode = not self.auto_mode # We need to refresh the color of the header and footer self.repaint() def put_line(self, row, header=False, motd=False): ## ROW: (id,proc,user,runtime,sleeptime,status,progress,arguments,priority,host) ## 0 1 2 3 4 5 6 7 8 9 - col_w = [7 , 25, 15, 21, 7, 11, 21, 60] + col_w = [8 , 25, 15, 21, 7, 12, 21, 60] maxx = self.width if self.y == self.selected_line - self.first_visible_line and self.y > 1: self.item_status = row[5] self.currentrow = row if motd: attr = self.curses.color_pair(1) + self.curses.A_BOLD elif self.y == self.header_lines - 2: if self.auto_mode: attr = self.curses.color_pair(2) + self.curses.A_STANDOUT + self.curses.A_BOLD else: attr = self.curses.color_pair(8) + self.curses.A_STANDOUT + self.curses.A_BOLD elif row[5] == "DONE": attr = self.curses.color_pair(5) + self.curses.A_BOLD elif row[5] == "STOPPED": attr = self.curses.color_pair(6) + self.curses.A_BOLD elif row[5].find("ERROR") > -1: attr = self.curses.color_pair(4) + self.curses.A_BOLD elif row[5] == "WAITING": attr = self.curses.color_pair(3) + self.curses.A_BOLD elif row[5] in ("RUNNING", "CONTINUING"): attr = self.curses.color_pair(2) + self.curses.A_BOLD elif not header and row[8]: attr = self.curses.A_BOLD else: attr = self.curses.A_NORMAL ## If the task is not relevant for this instance ob BibSched because ## the type of the task can not be run, or it is running on another ## machine: make it a different color if not header and (row[1].split(':')[0] not in self.allowed_task_types or (row[9] != '' and row[9] != self.hostname)): attr = self.curses.color_pair(6) if not row[6]: nrow = list(row) nrow[6] = 'Not allowed on this instance' row = tuple(nrow) if self.y == self.selected_line - self.first_visible_line and self.y > 1: self.current_attr = attr attr += self.curses.A_REVERSE if header: # Dirty hack. put_line should be better refactored. # row contains one less element: arguments ## !!! FIXME: THIS IS CRAP myline = str(row[0]).ljust(col_w[0]-1) myline += str(row[1]).ljust(col_w[1]-1) myline += str(row[2]).ljust(col_w[2]-1) myline += str(row[3]).ljust(col_w[3]-1) myline += str(row[4]).ljust(col_w[4]-1) myline += str(row[5]).ljust(col_w[5]-1) myline += str(row[6]).ljust(col_w[6]-1) myline += str(row[7]).ljust(col_w[7]-1) elif motd: myline = str(row[0]) else: ## ROW: (id,proc,user,runtime,sleeptime,status,progress,arguments,priority,host) ## 0 1 2 3 4 5 6 7 8 9 priority = str(row[8] and ' [%s]' % row[8] or '') myline = str(row[0]).ljust(col_w[0])[:col_w[0]-1] myline += (str(row[1])[:col_w[1]-len(priority)-2] + priority).ljust(col_w[1]-1) myline += str(row[2]).ljust(col_w[2])[:col_w[2]-1] myline += str(row[3]).ljust(col_w[3])[:col_w[3]-1] myline += str(row[4]).ljust(col_w[4])[:col_w[4]-1] myline += str(row[5]).ljust(col_w[5])[:col_w[5]-1] myline += str(row[9]).ljust(col_w[6])[:col_w[6]-1] myline += str(row[6]).ljust(col_w[7])[:col_w[7]-1] myline = myline.ljust(maxx) try: self.stdscr.addnstr(self.y, 0, myline, maxx, attr) except self.curses.error: pass self.y += 1 def display_in_footer(self, footer, i=0, print_time_p=0): if print_time_p: footer = "%s %s" % (footer, time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())) maxx = self.stdscr.getmaxyx()[1] footer = footer.ljust(maxx) if self.auto_mode: colorpair = 2 else: colorpair = 1 try: self.stdscr.addnstr(self.y - i, 0, footer, maxx - 1, self.curses.A_STANDOUT + self.curses.color_pair(colorpair) + self.curses.A_BOLD) except self.curses.error: pass def repaint(self): if server_pid(): self.auto_mode = 1 else: if self.auto_mode == 1: self.curses.beep() self.auto_mode = 0 self.y = 0 self.stdscr.erase() self.height, self.width = self.stdscr.getmaxyx() maxy = self.height - 2 #maxx = self.width if len(self.motd) > 0: self.put_line((self.motd.strip().replace("\n", " - ")[:79], "", "", "", "", "", "", "", ""), header=False, motd=True) self.put_line(("ID", "PROC [PRI]", "USER", "RUNTIME", "SLEEP", "STATUS", "HOST", "PROGRESS"), header=True) self.put_line(("", "", "", "", "", "", "", ""), header=True) if self.selected_line > maxy + self.first_visible_line - 1: self.first_visible_line = self.selected_line - maxy + 1 if self.selected_line < self.first_visible_line + 2: self.first_visible_line = self.selected_line - 2 for row in self.rows[self.first_visible_line:self.first_visible_line+maxy-2]: self.put_line(row) self.y = self.stdscr.getmaxyx()[0] - 1 if self.auto_mode: self.display_in_footer(self.footer_auto_mode, print_time_p=1) else: self.display_in_footer(self.footer_select_mode, print_time_p=1) footer2 = "" if self.item_status.find("DONE") > -1 or self.item_status in ("ERROR", "STOPPED", "KILLED", "ERRORS REPORTED"): footer2 += self.footer_stopped_item elif self.item_status in ("RUNNING", "CONTINUING", "ABOUT TO STOP", "ABOUT TO SLEEP"): footer2 += self.footer_running_item elif self.item_status == "SLEEPING": footer2 += self.footer_sleeping_item elif self.item_status == "WAITING": footer2 += self.footer_waiting_item self.display_in_footer(footer2, 1) self.stdscr.refresh() def update_rows(self): if self.display == 1: table = "schTASK" where = "and (status='DONE' or status LIKE 'ACK%')" order = "runtime DESC" limit = "" elif self.display == 2: table = "schTASK" where = "and (status<>'DONE' and status NOT LIKE 'ACK%')" order = "runtime ASC" limit = "limit %s" % CFG_BIBSCHED_MAX_ARCHIVED_ROWS_DISPLAY else: table = "hstTASK" order = "runtime DESC" where = "" limit = "" self.rows = run_sql("""SELECT id, proc, user, runtime, sleeptime, status, progress, arguments, priority, host, sequenceid FROM %s WHERE status NOT LIKE '%%_DELETED' %s ORDER BY %s %s""" % (table, where, order, limit)) # Make sure we are not selecting a line that disappeared self.selected_line = min(self.selected_line, len(self.rows) + self.header_lines - 1) def start(self, stdscr): os.environ['BIBSCHED_MODE'] = 'manual' if self.curses.has_colors(): self.curses.start_color() self.curses.init_pair(8, self.curses.COLOR_WHITE, self.curses.COLOR_BLACK) self.curses.init_pair(1, self.curses.COLOR_WHITE, self.curses.COLOR_RED) self.curses.init_pair(2, self.curses.COLOR_GREEN, self.curses.COLOR_BLACK) self.curses.init_pair(3, self.curses.COLOR_MAGENTA, self.curses.COLOR_BLACK) self.curses.init_pair(4, self.curses.COLOR_RED, self.curses.COLOR_BLACK) self.curses.init_pair(5, self.curses.COLOR_BLUE, self.curses.COLOR_BLACK) self.curses.init_pair(6, self.curses.COLOR_CYAN, self.curses.COLOR_BLACK) self.curses.init_pair(7, self.curses.COLOR_YELLOW, self.curses.COLOR_BLACK) self.stdscr = stdscr self.base_panel = self.curses.panel.new_panel(self.stdscr) self.base_panel.bottom() self.curses.panel.update_panels() self.height, self.width = stdscr.getmaxyx() self.stdscr.erase() if server_pid(): self.auto_mode = 1 ring = 4 if len(self.motd) > 0: self._display_message_box(self.motd + "\nPress any key to close") while self.running: if ring == 4: self.read_motd() self.update_rows() ring = 0 self.repaint() ring += 1 char = -1 try: char = timed_out(self.stdscr.getch, 1) if char == 27: # escaping sequence char = self.stdscr.getch() if char == 79: # arrow char = self.stdscr.getch() if char == 65: # arrow up char = self.curses.KEY_UP elif char == 66: # arrow down char = self.curses.KEY_DOWN elif char == 72: char = self.curses.KEY_PPAGE elif char == 70: char = self.curses.KEY_NPAGE elif char == 91: char = self.stdscr.getch() if char == 53: char = self.stdscr.getch() if char == 126: char = self.curses.KEY_HOME except TimedOutExc: char = -1 self.handle_keys(char) class BibSched(object): def __init__(self, debug=False): self.debug = debug self.hostname = gethostname() self.helper_modules = CFG_BIBTASK_VALID_TASKS ## All the tasks in the queue that the node is allowed to manipulate self.node_relevant_bibupload_tasks = () self.node_relevant_waiting_tasks = () self.node_relevant_active_tasks = () ## All tasks of all nodes self.active_tasks_all_nodes = () self.mono_tasks_all_nodes = () self.allowed_task_types = CFG_BIBSCHED_NODE_TASKS.get(self.hostname, CFG_BIBTASK_VALID_TASKS) os.environ['BIBSCHED_MODE'] = 'automatic' def tie_task_to_host(self, task_id): """Sets the hostname of a task to the machine executing this script @return: True if the scheduling was successful, False otherwise, e.g. if the task was scheduled concurrently on a different host. """ if not run_sql("""SELECT id FROM schTASK WHERE id=%s AND host='' AND status='WAITING'""", (task_id, )): ## The task was already tied? return False run_sql("""UPDATE schTASK SET host=%s, status='SCHEDULED' WHERE id=%s AND host='' AND status='WAITING'""", (self.hostname, task_id)) return bool(run_sql("SELECT id FROM schTASK WHERE id=%s AND host=%s", (task_id, self.hostname))) def filter_for_allowed_tasks(self): """ Removes all tasks that are not allowed in this Invenio instance """ def relevant_task(task_id, proc, runtime, status, priority, host, sequenceid): # pylint: disable=W0613 # if host and self.hostname != host: # return False procname = proc.split(':')[0] if procname not in self.allowed_task_types: return False return True def filter_tasks(tasks): return tuple(t for t in tasks if relevant_task(*t)) self.node_relevant_bibupload_tasks = filter_tasks(self.node_relevant_bibupload_tasks) self.node_relevant_active_tasks = filter_tasks(self.node_relevant_active_tasks) self.node_relevant_waiting_tasks = filter_tasks(self.node_relevant_waiting_tasks) self.node_relevant_sleeping_tasks = filter_tasks(self.node_relevant_sleeping_tasks) def is_task_safe_to_execute(self, proc1, proc2): """Return True when the two tasks can run concurrently.""" return proc1 != proc2 # and not proc1.startswith('bibupload') and not proc2.startswith('bibupload') def get_tasks_to_sleep_and_stop(self, proc, task_set): """Among the task_set, return the list of tasks to stop and the list of tasks to sleep. """ if proc in CFG_BIBTASK_MONOTASKS: return [], [t for t in task_set if t[3] not in ('SLEEPING', 'ABOUT TO SLEEP')] min_prio = None min_task_id = None min_proc = None min_status = None min_sequenceid = None to_stop = [] ## For all the lower priority tasks... for (this_task_id, this_proc, this_priority, this_status, this_sequenceid) in task_set: if not self.is_task_safe_to_execute(this_proc, proc): to_stop.append((this_task_id, this_proc, this_priority, this_status, this_sequenceid)) elif (min_prio is None or this_priority < min_prio) and \ this_status not in ('SLEEPING', 'ABOUT TO SLEEP'): ## We don't put to sleep already sleeping task :-) min_prio = this_priority min_task_id = this_task_id min_proc = this_proc min_status = this_status min_sequenceid = this_sequenceid if to_stop: return to_stop, [] elif min_task_id: return [], [(min_task_id, min_proc, min_prio, min_status, min_sequenceid)] else: return [], [] def split_active_tasks_by_priority(self, task_id, priority): """Return two lists: the list of task_ids with lower priority and those with higher or equal priority.""" higher = [] lower = [] ### !!! We already have this in node_relevant_active_tasks for other_task_id, task_proc, dummy, status, task_priority, task_host, sequenceid in self.node_relevant_active_tasks: # for other_task_id, task_proc, runtime, status, task_priority, task_host in self.node_relevant_active_tasks: # for other_task_id, task_proc, task_priority, status in self.get_running_tasks(): if task_id == other_task_id: continue if task_priority < priority and task_host == self.hostname: lower.append((other_task_id, task_proc, task_priority, status, sequenceid)) elif task_host == self.hostname: higher.append((other_task_id, task_proc, task_priority, status, sequenceid)) return lower, higher def handle_task(self, task_id, proc, runtime, status, priority, host, sequenceid): """Perform needed action of the row representing a task. Return True when task_status need to be refreshed""" debug = self.debug if debug: Log("task_id: %s, proc: %s, runtime: %s, status: %s, priority: %s, host: %s, sequenceid: %s" % (task_id, proc, runtime, status, priority, host, sequenceid)) if (task_id, proc, runtime, status, priority, host, sequenceid) in self.node_relevant_active_tasks: # For multi-node # check if we need to sleep ourselves for monotasks to be able to run for other_task_id, other_proc, dummy_other_runtime, other_status, other_priority, other_host, other_sequenceid in self.mono_tasks_all_nodes: if priority < other_priority: # Sleep ourselves if status not in ('SLEEPING', 'ABOUT TO SLEEP'): sleep_task(task_id, proc, priority, status, sequenceid) return True return False elif (task_id, proc, runtime, status, priority, host, sequenceid) in self.node_relevant_waiting_tasks: if debug: Log("Trying to run %s" % task_id) if priority < -10: if debug: Log("Cannot run because priority < -10") return False lower, higher = self.split_active_tasks_by_priority(task_id, priority) if debug: Log('lower: %s' % lower) Log('higher: %s' % higher) for other_task_id, other_proc, dummy_other_runtime, other_status, \ other_priority, other_host, other_sequenceid in chain( self.node_relevant_sleeping_tasks, self.active_tasks_all_nodes): if task_id != other_task_id and \ not self.is_task_safe_to_execute(proc, other_proc): ### !!! WE NEED TO CHECK FOR TASKS THAT CAN ONLY BE EXECUTED ON ONE MACHINE AT ONE TIME ### !!! FOR EXAMPLE BIBUPLOADS WHICH NEED TO BE EXECUTED SEQUENTIALLY AND NEVER CONCURRENTLY ## There's at least a higher priority task running that ## cannot run at the same time of the given task. ## We give up if debug: Log("Cannot run because task_id: %s, proc: %s is in the queue and incompatible" % (other_task_id, other_proc)) return False if sequenceid: ## Let's normalize the prority of all tasks in a sequenceid to the ## max priority of the group max_priority = run_sql("""SELECT MAX(priority) FROM schTASK WHERE status='WAITING' AND sequenceid=%s""", (sequenceid, ))[0][0] if run_sql("""UPDATE schTASK SET priority=%s WHERE status='WAITING' AND sequenceid=%s""", (max_priority, sequenceid)): Log("Raised all waiting tasks with sequenceid " "%s to the max priority %s" % (sequenceid, max_priority)) ## Some priorities where raised return True ## Let's normalize the runtime of all tasks in a sequenceid to ## the compatible runtime. current_runtimes = run_sql("""SELECT id, runtime FROM schTASK WHERE sequenceid=%s AND status='WAITING' ORDER by id""", (sequenceid, )) runtimes_adjusted = False if current_runtimes: last_runtime = current_runtimes[0][1] for the_task_id, runtime in current_runtimes: if runtime < last_runtime: run_sql("""UPDATE schTASK SET runtime=%s WHERE id=%s""", (last_runtime, the_task_id)) if debug: Log("Adjusted runtime of task_id %s to %s in order to be executed in the correct sequenceid order" % (the_task_id, last_runtime)) runtimes_adjusted = True runtime = last_runtime last_runtime = runtime if runtimes_adjusted: ## Some runtime have been adjusted return True if sequenceid is not None: for other_task_id, dummy_other_proc, dummy_other_runtime, dummy_other_status, dummy_other_priority, dummy_other_host, other_sequenceid in self.active_tasks_all_nodes: if sequenceid == other_sequenceid and task_id > other_task_id: Log('Task %s need to run after task %s since they have the same sequence id: %s' % (task_id, other_task_id, sequenceid)) ## If there is a task with same sequence number then do not run the current task return False if proc in CFG_BIBTASK_MONOTASKS and higher: ## This is a monotask if debug: Log("Cannot run because this is a monotask and there are higher priority tasks: %s" % (higher, )) return False ## No higher priority task have issue with the given task. if proc not in CFG_BIBTASK_FIXEDTIMETASKS and len(higher) >= CFG_BIBSCHED_MAX_NUMBER_CONCURRENT_TASKS: if debug: Log("Cannot run because all resources (%s) are used (%s), higher: %s" % (CFG_BIBSCHED_MAX_NUMBER_CONCURRENT_TASKS, len(higher), higher)) return False ## Check for monotasks wanting to run for other_task_id, other_proc, dummy_other_runtime, other_status, other_priority, other_host, other_sequenceid in self.mono_tasks_all_nodes: if priority < other_priority: if debug: Log("Cannot run because there is a monotask with higher priority: %s %s" % (other_task_id, other_proc)) return False ## We check if it is necessary to stop/put to sleep some lower priority ## task. tasks_to_stop, tasks_to_sleep = self.get_tasks_to_sleep_and_stop(proc, lower) if debug: Log('tasks_to_stop: %s' % tasks_to_stop) Log('tasks_to_sleep: %s' % tasks_to_sleep) if tasks_to_stop and priority < 100: ## Only tasks with priority higher than 100 have the power ## to put task to stop. if debug: Log("Cannot run because there are task to stop: %s and priority < 100" % tasks_to_stop) return False procname = proc.split(':')[0] if not tasks_to_stop and (not tasks_to_sleep or (proc not in CFG_BIBTASK_MONOTASKS and len(self.node_relevant_active_tasks) < CFG_BIBSCHED_MAX_NUMBER_CONCURRENT_TASKS)): if proc in CFG_BIBTASK_MONOTASKS and self.active_tasks_all_nodes: if debug: Log("Cannot run because this is a monotask and there are other tasks running: %s" % (self.node_relevant_active_tasks, )) return False def task_in_same_host(dummy_task_id, dummy_proc, dummy_runtime, dummy_status, dummy_priority, host, dummy_sequenceid): return host == self.hostname def filter_by_host(tasks): return tuple(t for t in tasks if task_in_same_host(*t)) node_active_tasks = filter_by_host(self.node_relevant_active_tasks) if len(node_active_tasks) >= CFG_BIBSCHED_MAX_NUMBER_CONCURRENT_TASKS: if debug: Log("Cannot run because all resources (%s) are used (%s), active: %s" % (CFG_BIBSCHED_MAX_NUMBER_CONCURRENT_TASKS, len(node_active_tasks), node_active_tasks)) return False if status in ("SLEEPING", "ABOUT TO SLEEP"): if host == self.hostname: ## We can only wake up tasks that are running on our own host for other_task_id, other_proc, dummy_other_runtime, other_status, dummy_other_priority, other_host, dummy_other_sequenceid in self.node_relevant_active_tasks: ## But only if there are not other tasks still going to sleep, otherwise ## we might end up stealing the slot for an higher priority task. if other_task_id != task_id and other_status in ('ABOUT TO SLEEP', 'ABOUT TO STOP') and other_host == self.hostname: if debug: Log("Not yet waking up task #%d since there are other tasks (%s #%d) going to sleep (higher priority task incoming?)" % (task_id, other_proc, other_task_id)) return False bibsched_set_status(task_id, "CONTINUING", status) if not bibsched_send_signal(proc, task_id, signal.SIGCONT): bibsched_set_status(task_id, "ERROR", "CONTINUING") Log("Task #%d (%s) woken up but didn't existed anymore" % (task_id, proc)) return True Log("Task #%d (%s) woken up" % (task_id, proc)) return True else: return False elif procname in self.helper_modules: program = os.path.join(CFG_BINDIR, procname) ## Trick to log in bibsched.log the task exiting exit_str = '&& echo "`date "+%%Y-%%m-%%d %%H:%%M:%%S"` --> Task #%d (%s) exited" >> %s' % (task_id, proc, os.path.join(CFG_LOGDIR, 'bibsched.log')) command = "%s %s %s" % (program, str(task_id), exit_str) ### Set the task to scheduled and tie it to this host if self.tie_task_to_host(task_id): Log("Task #%d (%s) started" % (task_id, proc)) ### Relief the lock for the BibTask, it is safe now to do so spawn_task(command, wait=proc in CFG_BIBTASK_MONOTASKS) count = 10 while run_sql("""SELECT status FROM schTASK WHERE id=%s AND status='SCHEDULED'""", (task_id, )): ## Polling to wait for the task to really start, ## in order to avoid race conditions. if count <= 0: raise StandardError("Process %s (task_id: %s) was launched but seems not to be able to reach RUNNING status." % (proc, task_id)) time.sleep(CFG_BIBSCHED_REFRESHTIME) count -= 1 return True else: raise StandardError("%s is not in the allowed modules" % procname) else: ## It's not still safe to run the task. ## We first need to stop tasks that should be stopped ## and to put to sleep tasks that should be put to sleep for t in tasks_to_stop: stop_task(*t) for t in tasks_to_sleep: sleep_task(*t) time.sleep(CFG_BIBSCHED_REFRESHTIME) return True def check_errors(self): errors = run_sql("""SELECT id,proc,status FROM schTASK WHERE status = 'ERROR' OR status = 'DONE WITH ERRORS' OR status = 'CERROR'""") if errors: error_msgs = [] error_recoverable = True for e_id, e_proc, e_status in errors: if run_sql("""UPDATE schTASK SET status='ERRORS REPORTED' WHERE id = %s AND (status='CERROR' OR status='ERROR' OR status='DONE WITH ERRORS')""", [e_id]): msg = " #%s %s -> %s" % (e_id, e_proc, e_status) error_msgs.append(msg) if e_status in ('ERROR', 'DONE WITH ERRORS'): error_recoverable = False if error_msgs: msg = "BibTask with ERRORS:\n%s" % '\n'.join(error_msgs) if error_recoverable: raise RecoverableError(msg) else: raise StandardError(msg) def calculate_rows(self): """Return all the node_relevant_active_tasks to work on.""" try: self.check_errors() except RecoverableError, msg: register_emergency('Light emergency from %s: BibTask failed: %s' % (CFG_SITE_URL, msg)) max_bibupload_priority, min_bibupload_priority = run_sql( """SELECT MAX(priority), MIN(priority) FROM schTASK WHERE status IN ('WAITING', 'RUNNING', 'SLEEPING', 'ABOUT TO STOP', 'ABOUT TO SLEEP', 'SCHEDULED', 'CONTINUING') AND proc = 'bibupload' AND runtime <= NOW()""")[0] if max_bibupload_priority > min_bibupload_priority: run_sql( """UPDATE schTASK SET priority = %s WHERE status IN ('WAITING', 'RUNNING', 'SLEEPING', 'ABOUT TO STOP', 'ABOUT TO SLEEP', 'SCHEDULED', 'CONTINUING') AND proc = 'bibupload' AND runtime <= NOW() AND priority < %s""", (max_bibupload_priority, max_bibupload_priority)) ## The bibupload tasks are sorted by id, which means by the order they were scheduled self.node_relevant_bibupload_tasks = run_sql( """SELECT id, proc, runtime, status, priority, host, sequenceid FROM schTASK WHERE status IN ('WAITING', 'SLEEPING') AND proc = 'bibupload' AND runtime <= NOW() ORDER BY id ASC LIMIT 1""", n=1) ## The other tasks are sorted by priority self.node_relevant_waiting_tasks = run_sql( """SELECT id, proc, runtime, status, priority, host, sequenceid FROM schTASK WHERE (status='WAITING' AND runtime <= NOW()) OR status = 'SLEEPING' ORDER BY priority DESC, runtime ASC, id ASC""") self.node_relevant_sleeping_tasks = run_sql( """SELECT id, proc, runtime, status, priority, host, sequenceid FROM schTASK WHERE status = 'SLEEPING' ORDER BY priority DESC, runtime ASC, id ASC""") self.node_relevant_active_tasks = run_sql( """SELECT id, proc, runtime, status, priority, host, sequenceid FROM schTASK WHERE status IN ('RUNNING', 'CONTINUING', 'SCHEDULED', 'ABOUT TO STOP', 'ABOUT TO SLEEP')""") self.active_tasks_all_nodes = tuple(self.node_relevant_active_tasks) self.mono_tasks_all_nodes = tuple(t for t in self.node_relevant_waiting_tasks if is_monotask(*t)) ## Remove tasks that can not be executed on this host self.filter_for_allowed_tasks() def watch_loop(self): ## Cleaning up scheduled task not run because of bibsched being ## interrupted in the middle. run_sql("""UPDATE schTASK SET status = 'WAITING' WHERE status = 'SCHEDULED' AND host = %s""", (self.hostname, )) try: while True: if self.debug: Log("New bibsched cycle") self.calculate_rows() ## Let's first handle running node_relevant_active_tasks. for task in self.node_relevant_active_tasks: if self.handle_task(*task): break else: # If nothing has changed we can go on to run tasks. for task in self.node_relevant_waiting_tasks: if task[1] == 'bibupload' and self.node_relevant_bibupload_tasks: ## We switch in bibupload serial mode! ## which means we execute the first next bibupload. if self.handle_task(*self.node_relevant_bibupload_tasks[0]): ## Something has changed break elif self.handle_task(*task): ## Something has changed break else: time.sleep(CFG_BIBSCHED_REFRESHTIME) except Exception, err: register_exception(alert_admin=True) try: register_emergency('Emergency from %s: BibSched halted: %s' % (CFG_SITE_URL, err)) except NotImplementedError: pass raise class TimedOutExc(Exception): def __init__(self, value="Timed Out"): Exception.__init__(self) self.value = value def __str__(self): return repr(self.value) def timed_out(f, timeout, *args, **kwargs): def handler(signum, frame): # pylint: disable=W0613 raise TimedOutExc() old = signal.signal(signal.SIGALRM, handler) signal.alarm(timeout) try: result = f(*args, **kwargs) finally: signal.signal(signal.SIGALRM, old) signal.alarm(0) return result def Log(message): log = open(CFG_LOGDIR + "/bibsched.log", "a") log.write(time.strftime("%Y-%m-%d %H:%M:%S --> ", time.localtime())) log.write(message) log.write("\n") log.close() def redirect_stdout_and_stderr(): "This function redirects stdout and stderr to bibsched.log and bibsched.err file." old_stdout = sys.stdout old_stderr = sys.stderr sys.stdout = open(CFG_LOGDIR + "/bibsched.log", "a") sys.stderr = open(CFG_LOGDIR + "/bibsched.err", "a") return old_stdout, old_stderr def restore_stdout_and_stderr(stdout, stderr): sys.stdout = stdout sys.stderr = stderr def usage(exitcode=1, msg=""): """Prints usage info.""" if msg: sys.stderr.write("Error: %s.\n" % msg) sys.stderr.write("""\ Usage: %s [options] [start|stop|restart|monitor|status] The following commands are available for bibsched: start start bibsched in background stop stop running bibtasks and the bibsched daemon safely halt halt running bibsched while keeping bibtasks running restart restart running bibsched monitor enter the interactive monitor status get report about current status of the queue purge purge the scheduler queue from old tasks General options: -h, --help \t Print this help. -V, --version \t Print version information. -q, --quiet \t Quiet mode -d, --debug \t Write debugging information in bibsched.log Status options: -s, --status=LIST\t Which BibTask status should be considered (default is Running,waiting) -S, --since=TIME\t Since how long time to consider tasks e.g.: 30m, 2h, 1d (default is all) -t, --tasks=LIST\t Comma separated list of BibTask to consider (default \t is all) Purge options: -s, --status=LIST\t Which BibTask status should be considered (default is DONE) -S, --since=TIME\t Since how long time to consider tasks e.g.: 30m, 2h, 1d (default is %s days) -t, --tasks=LIST\t Comma separated list of BibTask to consider (default \t is %s) """ % (sys.argv[0], CFG_BIBSCHED_GC_TASKS_OLDER_THAN, ','.join(CFG_BIBSCHED_GC_TASKS_TO_REMOVE + CFG_BIBSCHED_GC_TASKS_TO_ARCHIVE))) sys.exit(exitcode) pidfile = os.path.join(CFG_PREFIX, 'var', 'run', 'bibsched.pid') def error(msg): print >> sys.stderr, "error: %s" % msg sys.exit(1) def warning(msg): print >> sys.stderr, "warning: %s" % msg def server_pid(ping_the_process=True, check_is_really_bibsched=True): # The pid must be stored on the filesystem try: pid = int(open(pidfile).read()) except IOError: return None if ping_the_process: # Even if the pid is available, we check if it corresponds to an # actual process, as it might have been killed externally try: os.kill(pid, signal.SIGCONT) except OSError: warning("pidfile %s found referring to pid %s which is not running" % (pidfile, pid)) return None if check_is_really_bibsched: output = run_shell_command("ps p %s -o args=", (str(pid), ))[1] if not 'bibsched' in output: warning("pidfile %s found referring to pid %s which does not correspond to bibsched: cmdline is %s" % (pidfile, pid, output)) return None return pid def start(verbose=True, debug=False): """ Fork this process in the background and start processing requests. The process PID is stored in a pid file, so that it can be stopped later on.""" if verbose: sys.stdout.write("starting bibsched: ") sys.stdout.flush() pid = server_pid(ping_the_process=False) if pid: pid2 = server_pid() if pid2: error("another instance of bibsched (pid %d) is running" % pid2) else: warning("%s exist but the corresponding bibsched (pid %s) seems not be running" % (pidfile, pid)) warning("erasing %s and continuing..." % (pidfile, )) os.remove(pidfile) # start the child process using the "double fork" technique pid = os.fork() if pid > 0: sys.exit(0) os.setsid() os.chdir('/') pid = os.fork() if pid > 0: if verbose: sys.stdout.write('pid %d\n' % pid) Log("daemon started (pid %d)" % pid) open(pidfile, 'w').write('%d' % pid) return sys.stdin.close() redirect_stdout_and_stderr() sched = BibSched(debug=debug) try: sched.watch_loop() finally: try: os.remove(pidfile) except OSError: pass def halt(verbose=True, soft=False, debug=False): # pylint: disable=W0613 pid = server_pid() if not pid: if soft: print >> sys.stderr, 'bibsched seems not to be running.' return else: error('bibsched seems not to be running.') try: os.kill(pid, signal.SIGKILL) except OSError: print >> sys.stderr, 'no bibsched process found' Log("daemon stopped (pid %d)" % pid) if verbose: print "stopping bibsched: pid %d" % pid os.unlink(pidfile) def monitor(verbose=True, debug=False): # pylint: disable=W0613 old_stdout, old_stderr = redirect_stdout_and_stderr() try: Manager(old_stdout) finally: restore_stdout_and_stderr(old_stdout, old_stderr) def write_message(msg, stream=None, verbose=1): # pylint: disable=W0613 """Write message and flush output stream (may be sys.stdout or sys.stderr). Useful for debugging stuff.""" if stream is None: stream = sys.stdout if msg: if stream == sys.stdout or stream == sys.stderr: stream.write(time.strftime("%Y-%m-%d %H:%M:%S --> ", time.localtime())) try: stream.write("%s\n" % msg) except UnicodeEncodeError: stream.write("%s\n" % msg.encode('ascii', 'backslashreplace')) stream.flush() else: sys.stderr.write("Unknown stream %s. [must be sys.stdout or sys.stderr]\n" % stream) def report_queue_status(verbose=True, status=None, since=None, tasks=None): # pylint: disable=W0613 """ Report about the current status of BibSched queue on standard output. """ def report_about_processes(status='RUNNING', since=None, tasks=None): """ Helper function to report about processes with the given status. """ if tasks is None: task_query = '' else: task_query = 'AND proc IN (%s)' % ( ','.join([repr(real_escape_string(task)) for task in tasks])) if since is None: since_query = '' else: # We're not interested in future task if since.startswith('+') or since.startswith('-'): since = since[1:] since = '-' + since since_query = "AND runtime >= '%s'" % get_datetime(since) res = run_sql("""SELECT id, proc, user, runtime, sleeptime, status, progress, priority FROM schTASK WHERE status=%%s %(task_query)s %(since_query)s ORDER BY id ASC""" % { 'task_query': task_query, 'since_query' : since_query}, (status,)) write_message("%s processes: %d" % (status, len(res))) for (proc_id, proc_proc, proc_user, proc_runtime, proc_sleeptime, proc_status, proc_progress, proc_priority) in res: write_message(' * ID="%s" PRIORITY="%s" PROC="%s" USER="%s" ' 'RUNTIME="%s" SLEEPTIME="%s" STATUS="%s" ' 'PROGRESS="%s"' % (proc_id, proc_priority, proc_proc, proc_user, proc_runtime, proc_sleeptime, proc_status, proc_progress)) return write_message("BibSched queue status report for %s:" % gethostname()) mode = server_pid() and "AUTOMATIC" or "MANUAL" write_message("BibSched queue running mode: %s" % mode) if status is None: report_about_processes('Running', since, tasks) report_about_processes('Waiting', since, tasks) else: for state in status: report_about_processes(state, since, tasks) write_message("Done.") def restart(verbose=True, debug=False): halt(verbose, soft=True, debug=debug) start(verbose, debug=debug) def stop(verbose=True, debug=False): """ * Stop bibsched * Send stop signal to all the running tasks * wait for all the tasks to stop * return """ if verbose: print "Stopping BibSched if running" halt(verbose, soft=True, debug=debug) run_sql("UPDATE schTASK SET status='WAITING' WHERE status='SCHEDULED'") res = run_sql("""SELECT id, proc, status FROM schTASK WHERE status NOT LIKE 'DONE' AND status NOT LIKE '%_DELETED' AND (status='RUNNING' OR status='ABOUT TO STOP' OR status='ABOUT TO SLEEP' OR status='SLEEPING' OR status='CONTINUING')""") if verbose: print "Stopping all running BibTasks" for task_id, proc, status in res: if status == 'SLEEPING': bibsched_send_signal(proc, task_id, signal.SIGCONT) time.sleep(CFG_BIBSCHED_REFRESHTIME) bibsched_set_status(task_id, 'ABOUT TO STOP') while run_sql("""SELECT id FROM schTASK WHERE status NOT LIKE 'DONE' AND status NOT LIKE '%_DELETED' AND (status='RUNNING' OR status='ABOUT TO STOP' OR status='ABOUT TO SLEEP' OR status='SLEEPING' OR status='CONTINUING')"""): if verbose: sys.stdout.write('.') sys.stdout.flush() time.sleep(CFG_BIBSCHED_REFRESHTIME) if verbose: print "\nStopped" Log("BibSched and all BibTasks stopped") def main(): from invenio.bibtask import check_running_process_user check_running_process_user() verbose = True status = None since = None tasks = None debug = False try: opts, args = getopt.gnu_getopt(sys.argv[1:], "hVdqS:s:t:", [ "help", "version", "debug", "quiet", "since=", "status=", "task="]) except getopt.GetoptError, err: Log("Error: %s" % err) usage(1, err) for opt, arg in opts: if opt in ["-h", "--help"]: usage(0) elif opt in ["-V", "--version"]: print __revision__ sys.exit(0) elif opt in ['-q', '--quiet']: verbose = False elif opt in ['-s', '--status']: status = arg.split(',') elif opt in ['-S', '--since']: since = arg elif opt in ['-t', '--task']: tasks = arg.split(',') elif opt in ['-d', '--debug']: debug = True else: usage(1) try: cmd = args[0] except IndexError: cmd = 'monitor' try: if cmd in ('status', 'purge'): {'status' : report_queue_status, 'purge' : gc_tasks}[cmd](verbose, status, since, tasks) else: {'start': start, 'halt': halt, 'stop': stop, 'restart': restart, 'monitor': monitor}[cmd](verbose=verbose, debug=debug) except KeyError: usage(1, 'unkown command: %s' % cmd) if __name__ == '__main__': main() diff --git a/modules/miscutil/lib/web_api_key_unit_tests.py b/modules/miscutil/lib/web_api_key_regression_tests.py similarity index 97% copy from modules/miscutil/lib/web_api_key_unit_tests.py copy to modules/miscutil/lib/web_api_key_regression_tests.py index 5b1944ddb..72fa2b097 100644 --- a/modules/miscutil/lib/web_api_key_unit_tests.py +++ b/modules/miscutil/lib/web_api_key_regression_tests.py @@ -1,120 +1,120 @@ # -*- coding: utf-8 -*- ## ## This file is part of Invenio. ## Copyright (C) 2006, 2007, 2008, 2010, 2011 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. from invenio import web_api_key """Unit tests for REST like authentication API.""" try: import hashlib except: pass import unittest import re import hmac import urllib import time import string from invenio.testutils import make_test_suite, run_test_suite from invenio.dbquery import run_sql web_api_key.CFG_WEB_API_KEY_ALLOWED_URL = [('/search\?*', 0, True), ('/bad\?*', -1, True)] #Just for testing web_api_key._CFG_WEB_API_KEY_ALLOWED_URL = [(re.compile(_url), _authorized_time, _need_timestamp) for _url, _authorized_time, _need_timestamp in web_api_key.CFG_WEB_API_KEY_ALLOWED_URL] def build_web_request(path, params, api_key=None, secret_key=None): items = (hasattr(params, 'items') and [params.items()] or [list(params)])[0] if api_key: items.append(('apikey', api_key)) if secret_key: items.append(('timestamp', str(int(time.time())))) items = sorted(items, key=lambda x: x[0].lower()) url = '%s?%s' % (path, urllib.urlencode(items)) signature = hmac.new(secret_key, url, hashlib.sha1).hexdigest() items.append(('signature', signature)) if not items: return path return '%s?%s' % (path, urllib.urlencode(items)) class APIKeyTest(unittest.TestCase): """ Test functions related to the REST authentication API """ def setUp(self): self.id_admin = run_sql('SELECT id FROM user WHERE nickname="admin"')[0][0] def test_create_remove_show_key(self): """apikey - create/list/delete REST key""" self.assertEqual(0, len(web_api_key.show_web_api_keys(uid=self.id_admin))) web_api_key.create_new_web_api_key(self.id_admin, "Test key I") web_api_key.create_new_web_api_key(self.id_admin, "Test key II") web_api_key.create_new_web_api_key(self.id_admin, "Test key III") web_api_key.create_new_web_api_key(self.id_admin, "Test key IV") web_api_key.create_new_web_api_key(self.id_admin, "Test key V") self.assertEqual(5, len(web_api_key.show_web_api_keys(uid=self.id_admin))) self.assertEqual(5, len(web_api_key.show_web_api_keys(uid=self.id_admin, diff_status=''))) keys_info = web_api_key.show_web_api_keys(uid=self.id_admin) web_api_key.mark_web_api_key_as_removed(keys_info[0][0]) self.assertEqual(4, len(web_api_key.show_web_api_keys(uid=self.id_admin))) - self.assertEqual(5, len(web_api_key.show_web_api_keys(uid=self.id_admin,diff_status=''))) + self.assertEqual(5, len(web_api_key.show_web_api_keys(uid=self.id_admin, diff_status=''))) run_sql("UPDATE webapikey SET status='WARNING' WHERE id=%s", (keys_info[1][0],)) run_sql("UPDATE webapikey SET status='REVOKED' WHERE id=%s", (keys_info[2][0],)) self.assertEqual(4, len(web_api_key.show_web_api_keys(uid=self.id_admin))) self.assertEqual(5, len(web_api_key.show_web_api_keys(uid=self.id_admin, diff_status=''))) run_sql("DELETE FROM webapikey") def test_acc_get_uid_from_request(self): """webapikey - Login user from request using REST key""" path = '/search' params = 'ln=es&sc=1&c=Articles & Preprints&action_search=Buscar&p=ellis' self.assertEqual(0, len(web_api_key.show_web_api_keys(uid=self.id_admin))) web_api_key.create_new_web_api_key(self.id_admin, "Test key I") key_info = run_sql("SELECT id FROM webapikey WHERE id_user=%s", (self.id_admin,)) url = web_api_key.build_web_request(path, params, api_key=key_info[0][0]) url = string.split(url, '?') uid = web_api_key.acc_get_uid_from_request(url[0], url[1]) self.assertEqual(uid, self.id_admin) url = web_api_key.build_web_request(path, params, api_key=key_info[0][0]) url += "123" # corrupt the key url = string.split(url, '?') uid = web_api_key.acc_get_uid_from_request(url[0], url[1]) self.assertEqual(uid, -1) path = '/bad' uid = web_api_key.acc_get_uid_from_request(path, "") self.assertEqual(uid, -1) - params = { 'nocache': 'yes', 'limit': 123 } + params = {'nocache': 'yes', 'limit': 123} url = web_api_key.build_web_request(path, params, api_key=key_info[0][0]) url = string.split(url, '?') uid = web_api_key.acc_get_uid_from_request(url[0], url[1]) self.assertEqual(uid, -1) run_sql("DELETE FROM webapikey") TEST_SUITE = make_test_suite(APIKeyTest) if __name__ == "__main__": run_test_suite(TEST_SUITE) - run_sql("DELETE FROM webapikey") \ No newline at end of file + run_sql("DELETE FROM webapikey") diff --git a/modules/miscutil/lib/web_api_key_unit_tests.py b/modules/miscutil/lib/web_api_key_unit_tests.py index 5b1944ddb..beaf633fb 100644 --- a/modules/miscutil/lib/web_api_key_unit_tests.py +++ b/modules/miscutil/lib/web_api_key_unit_tests.py @@ -1,120 +1,32 @@ # -*- coding: utf-8 -*- ## ## This file is part of Invenio. -## Copyright (C) 2006, 2007, 2008, 2010, 2011 CERN. +## Copyright (C) 2006, 2007, 2008, 2010, 2011, 2013 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. -from invenio import web_api_key """Unit tests for REST like authentication API.""" -try: - import hashlib -except: - pass -import unittest -import re -import hmac -import urllib -import time -import string +# Note: tests moved to regression tests. Keeping this file here with +# empty test case set in order to overwrite any previously installed +# file. Also, keeping TEST_SUITE empty so that `inveniocfg +# --run-unit-tests' would not complain. from invenio.testutils import make_test_suite, run_test_suite -from invenio.dbquery import run_sql -web_api_key.CFG_WEB_API_KEY_ALLOWED_URL = [('/search\?*', 0, True), - ('/bad\?*', -1, True)] #Just for testing - -web_api_key._CFG_WEB_API_KEY_ALLOWED_URL = [(re.compile(_url), _authorized_time, _need_timestamp) - for _url, _authorized_time, _need_timestamp in web_api_key.CFG_WEB_API_KEY_ALLOWED_URL] - -def build_web_request(path, params, api_key=None, secret_key=None): - items = (hasattr(params, 'items') and [params.items()] or [list(params)])[0] - if api_key: - items.append(('apikey', api_key)) - if secret_key: - items.append(('timestamp', str(int(time.time())))) - items = sorted(items, key=lambda x: x[0].lower()) - url = '%s?%s' % (path, urllib.urlencode(items)) - signature = hmac.new(secret_key, url, hashlib.sha1).hexdigest() - items.append(('signature', signature)) - if not items: - return path - return '%s?%s' % (path, urllib.urlencode(items)) - -class APIKeyTest(unittest.TestCase): - """ Test functions related to the REST authentication API """ - def setUp(self): - self.id_admin = run_sql('SELECT id FROM user WHERE nickname="admin"')[0][0] - - def test_create_remove_show_key(self): - """apikey - create/list/delete REST key""" - self.assertEqual(0, len(web_api_key.show_web_api_keys(uid=self.id_admin))) - web_api_key.create_new_web_api_key(self.id_admin, "Test key I") - web_api_key.create_new_web_api_key(self.id_admin, "Test key II") - web_api_key.create_new_web_api_key(self.id_admin, "Test key III") - web_api_key.create_new_web_api_key(self.id_admin, "Test key IV") - web_api_key.create_new_web_api_key(self.id_admin, "Test key V") - self.assertEqual(5, len(web_api_key.show_web_api_keys(uid=self.id_admin))) - self.assertEqual(5, len(web_api_key.show_web_api_keys(uid=self.id_admin, diff_status=''))) - keys_info = web_api_key.show_web_api_keys(uid=self.id_admin) - web_api_key.mark_web_api_key_as_removed(keys_info[0][0]) - self.assertEqual(4, len(web_api_key.show_web_api_keys(uid=self.id_admin))) - self.assertEqual(5, len(web_api_key.show_web_api_keys(uid=self.id_admin,diff_status=''))) - - run_sql("UPDATE webapikey SET status='WARNING' WHERE id=%s", (keys_info[1][0],)) - run_sql("UPDATE webapikey SET status='REVOKED' WHERE id=%s", (keys_info[2][0],)) - - self.assertEqual(4, len(web_api_key.show_web_api_keys(uid=self.id_admin))) - self.assertEqual(5, len(web_api_key.show_web_api_keys(uid=self.id_admin, diff_status=''))) - - run_sql("DELETE FROM webapikey") - - def test_acc_get_uid_from_request(self): - """webapikey - Login user from request using REST key""" - path = '/search' - params = 'ln=es&sc=1&c=Articles & Preprints&action_search=Buscar&p=ellis' - - self.assertEqual(0, len(web_api_key.show_web_api_keys(uid=self.id_admin))) - web_api_key.create_new_web_api_key(self.id_admin, "Test key I") - - key_info = run_sql("SELECT id FROM webapikey WHERE id_user=%s", (self.id_admin,)) - url = web_api_key.build_web_request(path, params, api_key=key_info[0][0]) - url = string.split(url, '?') - uid = web_api_key.acc_get_uid_from_request(url[0], url[1]) - self.assertEqual(uid, self.id_admin) - - url = web_api_key.build_web_request(path, params, api_key=key_info[0][0]) - url += "123" # corrupt the key - url = string.split(url, '?') - uid = web_api_key.acc_get_uid_from_request(url[0], url[1]) - self.assertEqual(uid, -1) - - path = '/bad' - uid = web_api_key.acc_get_uid_from_request(path, "") - self.assertEqual(uid, -1) - params = { 'nocache': 'yes', 'limit': 123 } - url = web_api_key.build_web_request(path, params, api_key=key_info[0][0]) - url = string.split(url, '?') - uid = web_api_key.acc_get_uid_from_request(url[0], url[1]) - self.assertEqual(uid, -1) - - run_sql("DELETE FROM webapikey") - -TEST_SUITE = make_test_suite(APIKeyTest) +TEST_SUITE = make_test_suite() if __name__ == "__main__": run_test_suite(TEST_SUITE) - run_sql("DELETE FROM webapikey") \ No newline at end of file diff --git a/modules/webaccess/lib/external_authentication_robot.py b/modules/webaccess/lib/external_authentication_robot.py index 2374c3d9f..dbdd2fd4c 100644 --- a/modules/webaccess/lib/external_authentication_robot.py +++ b/modules/webaccess/lib/external_authentication_robot.py @@ -1,412 +1,412 @@ # -*- coding: utf-8 -*- ## ## This file is part of Invenio. ## Copyright (C) 2010, 2011, 2013 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """External user authentication for simple robots This implement an external authentication system suitable for robots usage. User attributes are retrieved directly from the form dictionary of the request object. """ import os import sys import hmac import time import base64 if sys.hexversion < 0x2050000: import sha as sha1 else: from hashlib import sha1 from cPickle import dumps from zlib import decompress, compress from invenio.jsonutils import json, json_unicode_to_utf8 from invenio.shellutils import mymkdir from invenio.external_authentication import ExternalAuth, InvenioWebAccessExternalAuthError from invenio.config import CFG_ETCDIR, CFG_SITE_URL, CFG_SITE_SECURE_URL CFG_ROBOT_EMAIL_ATTRIBUTE_NAME = 'email' CFG_ROBOT_NICKNAME_ATTRIBUTE_NAME = 'nickname' CFG_ROBOT_GROUPS_ATTRIBUTE_NAME = 'groups' CFG_ROBOT_TIMEOUT_ATTRIBUTE_NAME = '__timeout__' CFG_ROBOT_USERIP_ATTRIBUTE_NAME = '__userip__' CFG_ROBOT_GROUPS_SEPARATOR = ';' CFG_ROBOT_URL_TIMEOUT = 3600 CFG_ROBOT_KEYS_PATH = os.path.join(CFG_ETCDIR, 'webaccess', 'robot_keys.dat') def normalize_ip(ip, up_to_bytes=4): """ @param up_to_bytes: set this to the number of bytes that should be considered in the normalization. E.g. is this is set two 2, only the first two bytes will be considered, while the remaining two will be set to 0. @return: a normalized IP, e.g. 123.02.12.12 -> 123.2.12.12 """ try: ret = [] for i, number in enumerate(ip.split(".")): if i < up_to_bytes: ret.append(str(int(number))) else: ret.append("0") return '.'.join(ret) except ValueError: ## e.g. if it's IPV6 ::1 return ip def load_robot_keys(): """ @return: the robot key dictionary. """ from cPickle import loads from zlib import decompress try: robot_keys = loads(decompress(open(CFG_ROBOT_KEYS_PATH).read())) if not isinstance(robot_keys, dict): return {} else: return robot_keys except: return {} class ExternalAuthRobot(ExternalAuth): """ This class implement an external authentication method suitable to be used by an external service that, after having authenticated a user, will provide a URL to the user that, once followed, will successfully login the user into Invenio, with any detail the external service decided to provide to the Invenio installation. Such URL should be built as follows: BASE?QUERY where BASE is CFG_SITE_SECURE_URL/youraccount/robotlogin and QUERY is a urlencoded mapping of the following key->values: - assertion: an assertion, i.e. a piece of information describing the user, see below for more details. - robot: the identifier of the external service providing the assertion - login_method: the name of the login method as defined in CFG_EXTERNAL_AUTHENTICATION. - digest: the digest of the signature as detailed below. - referer: the URL where the user should be redirected after successful login (it is called referer as, for historical reasons, this is the original URL of the page on which, a human-user has clicked "login". the "assertion" should be a JSON serialized mapping with the following keys: - email: the email of the user (i.e. its identifier). - nickname: optional nickname of the user. - groups: an optional ';'-separated list of groups to which the user belongs to. - __timeout__: the number of seconds (floating point) from the Epoch, after which the URL will no longer be valid. (expressed in UTC) - __userip__: the IP address of the user for whom this URL has been created. (if the user will follow this URL using a different URL the request will not be valid) - any other key can be added and will be merged in the external user settings. If L{use_zlib} is True the assertion is a base64-url-flavour encoding of the zlib compression of the original assertion (useful for shortening the URL while make it easy to type). The "digest" is the hexadecimal representation of the digest using the HMAC-SHA1 method to sign the assertion with the secret key associated with the robot for the given login_method. @param enforce_external_nicknames: whether to trust nicknames provided by the external service and use them (if possible) as unique identifier in the system. @type enforce_external_nicknames: boolean @param email_attribute_name: the actual key in the assertion that will contain the email. @type email_attribute_name: string @param nickname_attribute_name: the actual key in the assertion that will contain the nickname. @type nickname_attribute_name: string @param groups_attribute_name: the actual key in the assertion that will contain the groups. @type groups_attribute_name: string @param groups_separator: the string used to separate groups. @type groups_separator: string @param timeout_attribute_name: the actual key in the assertion that will contain the timeout. @type timeout_attribute_name: string @param userip_attribute_name: the actual key in the assertion that will contain the user IP. @type userip_attribute_name: string @param external_id_attribute_name: the actual string that identifies the user in the external authentication system. By default this is set to be the same as the nickname, but this can be configured. @param check_user_ip: whether to check for the IP address of the user using the given URL, against the IP address stored in the assertion to be identical. If 0, no IP check will be performed, if 1, only the 1st byte will be compared, if 2, only the first two bytes will be compared, if 3, only the first three bytes, and if 4, the whole IP address will be checked. @type check_user_ip: int @param use_zlib: whether to use base64-url-flavour encoding of the zlib compression of the json serialization of the assertion or simply the json serialization of the assertion. @type use_zlib: boolean """ def __init__(self, enforce_external_nicknames=False, email_attribute_name=CFG_ROBOT_EMAIL_ATTRIBUTE_NAME, nickname_attribute_name=CFG_ROBOT_NICKNAME_ATTRIBUTE_NAME, groups_attribute_name=CFG_ROBOT_GROUPS_ATTRIBUTE_NAME, groups_separator=CFG_ROBOT_GROUPS_SEPARATOR, timeout_attribute_name=CFG_ROBOT_TIMEOUT_ATTRIBUTE_NAME, userip_attribute_name=CFG_ROBOT_USERIP_ATTRIBUTE_NAME, check_user_ip=4, external_id_attribute_name=CFG_ROBOT_NICKNAME_ATTRIBUTE_NAME, use_zlib=True, ): ExternalAuth.__init__(self, enforce_external_nicknames=enforce_external_nicknames) self.email_attribute_name = email_attribute_name self.nickname_attribute_name = nickname_attribute_name self.groups_attribute_name = groups_attribute_name self.groups_separator = groups_separator self.timeout_attribute_name = timeout_attribute_name self.userip_attribute_name = userip_attribute_name self.external_id_attribute_name = external_id_attribute_name self.check_user_ip = check_user_ip self.use_zlib = use_zlib def __extract_attribute(self, req): """ Load from the request the given assertion, extract all the attribute to properly login the user, and verify that the data are actually both well formed and signed correctly. """ from invenio.webinterface_handler import wash_urlargd args = wash_urlargd(req.form, { 'assertion': (str, ''), 'robot': (str, ''), 'digest': (str, ''), 'login_method': (str, '')}) assertion = args['assertion'] digest = args['digest'] robot = args['robot'] login_method = args['login_method'] shared_key = load_robot_keys().get(login_method, {}).get(robot) if shared_key is None: raise InvenioWebAccessExternalAuthError("A key does not exist for robot: %s, login_method: %s" % (robot, login_method)) if not self.verify(shared_key, assertion, digest): raise InvenioWebAccessExternalAuthError("The provided assertion does not validate against the digest %s for robot %s" % (repr(digest), repr(robot))) if self.use_zlib: try: ## Workaround to Perl implementation that does not add ## any padding to the base64 encoding. needed_pad = (4 - len(assertion) % 4) % 4 assertion += needed_pad * '=' assertion = decompress(base64.urlsafe_b64decode(assertion)) except: raise InvenioWebAccessExternalAuthError("The provided assertion is corrupted") data = json_unicode_to_utf8(json.loads(assertion)) if not isinstance(data, dict): raise InvenioWebAccessExternalAuthError("The provided assertion is invalid") timeout = data[self.timeout_attribute_name] if timeout < time.time(): raise InvenioWebAccessExternalAuthError("The provided assertion is expired") userip = data.get(self.userip_attribute_name) if not self.check_user_ip or (normalize_ip(userip, self.check_user_ip) == normalize_ip(req.remote_ip, self.check_user_ip)): return data else: raise InvenioWebAccessExternalAuthError("The provided assertion has been issued for a different IP address (%s instead of %s)" % (userip, req.remote_ip)) def auth_user(self, username, password, req=None): """Authenticate user-supplied USERNAME and PASSWORD. Return None if authentication failed, or the email address of the person if the authentication was successful. In order to do this you may perhaps have to keep a translation table between usernames and email addresses. Raise InvenioWebAccessExternalAuthError in case of external troubles. """ data = self.__extract_attribute(req) email = data.get(self.email_attribute_name) ext_id = data.get(self.external_id_attribute_name, email) if email: if isinstance(email, str): - return email.strip().lower(), ext_id.strip() + return email.strip().lower(), str(ext_id).strip() else: raise InvenioWebAccessExternalAuthError("The email provided in the assertion is invalid: %s" % (repr(email))) else: return None, None def fetch_user_groups_membership(self, username, password=None, req=None): """Given a username and a password, returns a dictionary of groups and their description to which the user is subscribed. Raise InvenioWebAccessExternalAuthError in case of troubles. """ if self.groups_attribute_name: data = self.__extract_attribute(req) groups = data.get(self.groups_attribute_name) if groups: if isinstance(groups, str): groups = [group.strip() for group in groups.split(self.groups_separator)] return dict(zip(groups, groups)) else: raise InvenioWebAccessExternalAuthError("The groups provided in the assertion are invalid: %s" % (repr(groups))) return {} def fetch_user_nickname(self, username, password=None, req=None): """Given a username and a password, returns the right nickname belonging to that user (username could be an email). """ if self.nickname_attribute_name: data = self.__extract_attribute(req) nickname = data.get(self.nickname_attribute_name) if nickname: if isinstance(nickname, str): return nickname.strip().lower() else: raise InvenioWebAccessExternalAuthError("The nickname provided in the assertion is invalid: %s" % (repr(nickname))) return None def fetch_user_preferences(self, username, password=None, req=None): """Given a username and a password, returns a dictionary of keys and values, corresponding to external infos and settings. userprefs = {"telephone": "2392489", "address": "10th Downing Street"} (WEBUSER WILL erase all prefs that starts by EXTERNAL_ and will store: "EXTERNAL_telephone"; all internal preferences can use whatever name but starting with EXTERNAL). If a pref begins with HIDDEN_ it will be ignored. """ data = self.__extract_attribute(req) for key in (self.email_attribute_name, self.groups_attribute_name, self.nickname_attribute_name, self.timeout_attribute_name, self.userip_attribute_name): if key and key in data: del data[key] return data def robot_login_method_p(): """Return True if this method is dedicated to robots and should not therefore be available as a choice to regular users upon login. """ return True robot_login_method_p = staticmethod(robot_login_method_p) def sign(secret, assertion): """ @return: a signature of the given assertion. @rtype: string @note: override this method if you want to change the signature algorithm (e.g. to use GPG). @see: L{verify} """ return hmac.new(secret, assertion, sha1).hexdigest() sign = staticmethod(sign) def verify(secret, assertion, signature): """ @return: True if the signature is valid @rtype: boolean @note: override this method if you want to change the signature algorithm (e.g. to use GPG) @see: L{sign} """ return hmac.new(secret, assertion, sha1).hexdigest() == signature verify = staticmethod(verify) def test_create_example_url(self, email, login_method, robot, ip, assertion=None, timeout=None, referer=None, groups=None, nickname=None): """ Create a test URL to test the robot login. @param email: email of the user we want to login as. @type email: string @param login_method: the login_method name as specified in CFG_EXTERNAL_AUTHENTICATION. @type login_method: string @param robot: the identifier of this robot. @type robot: string @param assertion: any further data we want to send to. @type: json serializable mapping @param ip: the IP of the user. @type: string @param timeout: timeout when the URL will expire (in seconds from the Epoch) @type timeout: float @param referer: the URL where to land after successful login. @type referer: string @param groups: the list of optional group of the user. @type groups: list of string @param nickname: the optional nickname of the user. @type nickname: string @return: the URL to login as the user. @rtype: string """ from invenio.access_control_config import CFG_EXTERNAL_AUTHENTICATION from invenio.urlutils import create_url if assertion is None: assertion = {} assertion[self.email_attribute_name] = email if nickname: assertion[self.nickname_attribute_name] = nickname if groups: assertion[self.groups_attribute_name] = self.groups_separator.join(groups) if timeout is None: timeout = time.time() + CFG_ROBOT_URL_TIMEOUT assertion[self.timeout_attribute_name] = timeout if referer is None: referer = CFG_SITE_URL if login_method is None: for a_login_method, details in CFG_EXTERNAL_AUTHENTICATION.iteritems(): if details[2]: login_method = a_login_method break robot_keys = load_robot_keys() assertion[self.userip_attribute_name] = ip assertion = json.dumps(assertion) if self.use_zlib: assertion = base64.urlsafe_b64encode(compress(assertion)) shared_key = robot_keys[login_method][robot] digest = self.sign(shared_key, assertion) return create_url("%s%s" % (CFG_SITE_SECURE_URL, "/youraccount/robotlogin"), { 'assertion': assertion, 'robot': robot, 'login_method': login_method, 'digest': digest, 'referer': referer}) def update_robot_key(login_method, robot, key=None): """ Utility to update the robot key store. @param login_method: the login_method name as per L{CFG_EXTERNAL_AUTHENTICATION}. It should correspond to a robot-enable login method. @type: string @param robot: the robot identifier @type robot: string @param key: the secret @type key: string @note: if the secret is empty the corresponding key will be removed. """ robot_keys = load_robot_keys() if key is None and login_method in robot_keys and robot in robot_keys[login_method]: del robot_keys[login_method][robot] if not robot_keys[login_method]: del robot_keys[login_method] else: if login_method not in robot_keys: robot_keys[login_method] = {} robot_keys[login_method][robot] = key mymkdir(os.path.join(CFG_ETCDIR, 'webaccess')) open(CFG_ROBOT_KEYS_PATH, 'w').write(compress(dumps(robot_keys, -1))) diff --git a/modules/websearch/lib/search_engine_query_parser.py b/modules/websearch/lib/search_engine_query_parser.py index f983f045f..4cc7bd88c 100644 --- a/modules/websearch/lib/search_engine_query_parser.py +++ b/modules/websearch/lib/search_engine_query_parser.py @@ -1,1251 +1,1338 @@ # -*- coding: utf-8 -*- ## This file is part of Invenio. -## Copyright (C) 2008, 2010, 2011, 2012 CERN. +## Copyright (C) 2008, 2010, 2011, 2012, 2013 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. # pylint: disable=C0301 """Invenio Search Engine query parsers.""" import re import string from datetime import datetime try: import dateutil if not hasattr(dateutil, '__version__') or dateutil.__version__ != '2.0': from dateutil import parser as du_parser from dateutil.relativedelta import relativedelta as du_delta + from dateutil import relativedelta GOT_DATEUTIL = True else: from warnings import warn warn("Not using dateutil module because the version %s is not compatible with Python-2.x" % dateutil.__version__) GOT_DATEUTIL = False except ImportError: # Ok, no date parsing is possible, but continue anyway, # since this package is only recommended, not mandatory. GOT_DATEUTIL = False from invenio.bibindex_engine_tokenizer import BibIndexFuzzyNameTokenizer as FNT from invenio.logicutils import to_cnf from invenio.config import CFG_WEBSEARCH_SPIRES_SYNTAX NameScanner = FNT() class InvenioWebSearchMismatchedParensError(Exception): """Exception for parse errors caused by mismatched parentheses.""" def __init__(self, message): """Initialization.""" self.message = message def __str__(self): """String representation.""" return repr(self.message) class SearchQueryParenthesisedParser(object): """Search query parser that handles arbitrarily-nested parentheses Parameters: * substitution_dict: a dictionary mapping strings to other strings. By default, maps 'and', 'or' and 'not' to '+', '|', and '-'. Dictionary values will be treated as valid operators for output. A note (valkyrie 25.03.2011): Based on looking through the prod search logs, it is evident that users, when they are using parentheses to do searches, only run word characters up against parens when they intend the parens to be part of the word (e.g. U(1)), and when they are using parentheses to combine operators, they put a space before and after them. As of writing, this is the behavior that SQPP now expects, in order that it be able to handle such queries as e(+)e(-) that contain operators in parentheses that should be interpreted as words. """ def __init__(self, substitution_dict = {'and': '+', 'or': '|', 'not': '-'}): self.substitution_dict = substitution_dict self.specials = set(['(', ')', '+', '|', '-', '+ -']) self.__tl_idx = 0 self.__tl_len = 0 # I think my names are both concise and clear # pylint: disable=C0103 def _invenio_to_python_logical(self, q): """Translate the + and - in invenio query strings into & and ~.""" p = q p = re.sub('\+ -', '&~', p) p = re.sub('\+', '&', p) p = re.sub('-', '~', p) p = re.sub(' ~', ' & ~', p) return p def _python_logical_to_invenio(self, q): """Translate the & and ~ in logical expression strings into + and -.""" p = q p = re.sub('\& ~', '-', p) p = re.sub('~', '-', p) p = re.sub('\&', '+', p) return p # pylint: enable=C0103 def parse_query(self, query): """Make query into something suitable for search_engine. This is the main entry point of the class. Given an expression of the form: "expr1 or expr2 (expr3 not (expr4 or expr5))" produces annoted list output suitable for consumption by search_engine, of the form: ['+', 'expr1', '|', 'expr2', '+', 'expr3 - expr4 | expr5'] parse_query() is a wrapper for self.tokenize() and self.parse(). """ toklist = self.tokenize(query) depth, balanced, dummy_d0_p = self.nesting_depth_and_balance(toklist) if not balanced: raise SyntaxError("Mismatched parentheses in "+str(toklist)) toklist, var_subs = self.substitute_variables(toklist) if depth > 1: toklist = self.tokenize(self.logically_reduce(toklist)) return self.parse(toklist, var_subs) def substitute_variables(self, toklist): """Given a token list, return a copy of token list in which all free variables are bound with boolean variable names of the form 'pN'. Additionally, all the substitutable logical operators are exchanged for their symbolic form and implicit ands are made explicit e.g., ((author:'ellis, j' and title:quark) or author:stevens jones) becomes: ((p0 + p1) | p2 + p3) with the substitution table: {'p0': "author:'ellis, j'", 'p1': "title:quark", 'p2': "author:stevens", 'p3': "jones" } Return value is the substituted token list and a copy of the substitution table. """ def labels(): i = 0 while True: yield 'p'+str(i) i += 1 def filter_front_ands(toklist): """Filter out extra logical connectives and whitespace from the front.""" while toklist[0] == '+' or toklist[0] == '|' or toklist[0] == '': toklist = toklist[1:] return toklist var_subs = {} labeler = labels() new_toklist = [''] cannot_be_anded = self.specials.difference((')',)) for token in toklist: token = token.lower() if token in self.substitution_dict: if token == 'not' and new_toklist[-1] == '+': new_toklist[-1] = '-' else: new_toklist.append(self.substitution_dict[token]) elif token == '(': if new_toklist[-1] not in self.specials: new_toklist.append('+') new_toklist.append(token) elif token not in self.specials: # apparently generators are hard for pylint to figure out # Turns off msg about labeler not having a 'next' method # pylint: disable=E1101 label = labeler.next() # pylint: enable=E1101 var_subs[label] = token if new_toklist[-1] not in cannot_be_anded: new_toklist.append('+') new_toklist.append(label) else: if token == '-' and new_toklist[-1] == '+': new_toklist[-1] = '-' else: new_toklist.append(token) return filter_front_ands(new_toklist), var_subs def nesting_depth_and_balance(self, token_list): """Checks that parentheses are balanced and counts how deep they nest""" depth = 0 maxdepth = 0 depth0_pairs = 0 good_depth = True for i in range(len(token_list)): token = token_list[i] if token == '(': if depth == 0: depth0_pairs += 1 depth += 1 if depth > maxdepth: maxdepth += 1 elif token == ')': depth -= 1 if depth == -1: # can only happen with unmatched ) good_depth = False # so force depth check to fail depth = 0 # but keep maxdepth in good range return maxdepth, depth == 0 and good_depth, depth0_pairs def logically_reduce(self, token_list): """Return token_list in conjunctive normal form as a string. CNF has the property that there will only ever be one level of parenthetical nesting, and all distributable operators (such as the not in -(p | q) will be fully distributed (as -p + -q). """ maxdepth, dummy_balanced, d0_p = self.nesting_depth_and_balance(token_list) s = ' '.join(token_list) s = self._invenio_to_python_logical(s) last_maxdepth = 0 while maxdepth != last_maxdepth: # XXX: sometimes NaryExpr doesn't try: # fully flatten Expr; but it usually s = str(to_cnf(s)) # does in 2 passes FIXME: diagnose except SyntaxError: raise SyntaxError(str(s)+" couldn't be converted to a logic expression.") last_maxdepth = maxdepth maxdepth, dummy_balanced, d0_p = self.nesting_depth_and_balance(self.tokenize(s)) if d0_p == 1 and s[0] == '(' and s[-1] == ')': # s can come back with extra parens s = s[1:-1] s = self._python_logical_to_invenio(s) return s def tokenize(self, query): """Given a query string, return a list of tokens from that string. * Isolates meaningful punctuation: ( ) + | - * Keeps single- and double-quoted strings together without interpretation. * Splits everything else on whitespace. i.e.: "expr1|expr2 (expr3-(expr4 or expr5))" becomes: ['expr1', '|', 'expr2', '(', 'expr3', '-', '(', 'expr4', 'or', 'expr5', ')', ')'] special case: "e(+)e(-)" interprets '+' and '-' as word characters since they are in parens with word characters run up against them. it becomes: ['e(+)e(-)'] """ ### # Invariants: # * Query is never modified # * In every loop iteration, querytokens grows to the right # * The only return point is at the bottom of the function, and the only # return value is querytokens ### def get_tokens(s): """ Given string s, return a list of s's tokens. Adds space around special punctuation, then splits on whitespace. """ s = ' '+s s = s.replace('->', '####DATE###RANGE##OP#') # XXX: Save '->' s = re.sub('(?P[a-zA-Z0-9_,=:]+)\((?P[a-zA-Z0-9_,+-/]*)\)', '#####\g####PAREN###\g##PAREN#', s) # XXX: Save U(1) and SL(2,Z) s = re.sub('####PAREN###(?P[.0-9/-]*)(?P[+])(?P[.0-9/-]*)##PAREN#', '####PAREN###\g##PLUS##\g##PAREN#', s) s = re.sub('####PAREN###(?P([.0-9/]|##PLUS##)*)(?P[-])' +\ '(?P([.0-9/]|##PLUS##)*)##PAREN#', '####PAREN###\g##MINUS##\g##PAREN#', s) # XXX: Save e(+)e(-) for char in self.specials: if char == '-': s = s.replace(' -', ' - ') s = s.replace(')-', ') - ') s = s.replace('-(', ' - (') else: s = s.replace(char, ' '+char+' ') s = re.sub('##PLUS##', '+', s) s = re.sub('##MINUS##', '-', s) # XXX: Restore e(+)e(-) s = re.sub('#####(?P[a-zA-Z0-9_,=:]+)####PAREN###(?P[a-zA-Z0-9_,+-/]*)##PAREN#', '\g(\g)', s) # XXX: Restore U(1) and SL(2,Z) s = s.replace('####DATE###RANGE##OP#', '->') # XXX: Restore '->' return s.split() querytokens = [] current_position = 0 re_quotes_match = re.compile(r'(?![\\])(".*?[^\\]")' + r"|(?![\\])('.*?[^\\]')") for match in re_quotes_match.finditer(query): match_start = match.start() quoted_region = match.group(0).strip() # clean the content after the previous quotes and before current quotes unquoted = query[current_position : match_start] querytokens.extend(get_tokens(unquoted)) # XXX: In case we end up with e.g. title:, "compton scattering", make it # title:"compton scattering" if querytokens and querytokens[0] and querytokens[-1][-1] == ':': querytokens[-1] += quoted_region # XXX: In case we end up with e.g. "expr1",->,"expr2", make it # "expr1"->"expr2" elif len(querytokens) >= 2 and querytokens[-1] == '->': arrow = querytokens.pop() querytokens[-1] += arrow + quoted_region else: # add our newly tokenized content to the token list querytokens.extend([quoted_region]) # move current position to the end of the tokenized content current_position = match.end() # get tokens from the last appearance of quotes until the query end unquoted = query[current_position : len(query)] querytokens.extend(get_tokens(unquoted)) return querytokens def parse(self, token_list, variable_substitution_dict=None): """Make token_list consumable by search_engine. Turns a list of tokens and a variable mapping into a grouped list of subexpressions in the format suitable for use by search_engine, e.g.: ['+', 'searchterm', '-', 'searchterm to exclude', '|', 'another term'] Incidentally, this works recursively so parens can cause arbitrarily deep nestings. But since the search_engine doesn't know about nested structures, we need to flatten the input structure first. """ ### # Invariants: # * Token list is never modified # * Balanced parens remain balanced; unbalanced parens are an error # * Individual tokens may only be exchanged for items in the variable # substitution dict; otherwise they pass through unmolested # * Return value is built up mostly as a stack ### op_symbols = self.substitution_dict.values() self.__tl_idx = 0 self.__tl_len = len(token_list) def inner_parse(token_list, open_parens=False): ''' although it's not in the API, it seems sensible to comment this function a bit. dist_token here is a token (e.g. a second-order operator) which needs to be distributed across other tokens inside the inner parens ''' if open_parens: parsed_values = [] else: parsed_values = ['+'] i = 0 while i < len(token_list): token = token_list[i] if i > 0 and parsed_values[-1] not in op_symbols: parsed_values.append('+') if token == '(': # if we need to distribute something over the tokens inside the parens # we will know it because... it will end in a : # that part of the list will be 'px', '+', '(' distributing = (len(parsed_values) > 2 and parsed_values[-2].endswith(':') and parsed_values[-1] == '+') if distributing: # we don't need the + if we are distributing parsed_values = parsed_values[:-1] offset = self.__tl_len - len(token_list) inner_value = inner_parse(token_list[i+1:], True) inner_value = ' '.join(inner_value) if distributing: if len(self.tokenize(inner_value)) == 1: parsed_values[-1] = parsed_values[-1] + inner_value elif "'" in inner_value: parsed_values[-1] = parsed_values[-1] + '"' + inner_value + '"' elif '"' in inner_value: parsed_values[-1] = parsed_values[-1] + "'" + inner_value + "'" else: parsed_values[-1] = parsed_values[-1] + '"' + inner_value + '"' else: parsed_values.append(inner_value) self.__tl_idx += 1 i = self.__tl_idx - offset elif token == ')': if parsed_values[-1] in op_symbols: parsed_values = parsed_values[:-1] if len(parsed_values) > 1 and parsed_values[0] == '+' and parsed_values[1] in op_symbols: parsed_values = parsed_values[1:] return parsed_values elif token in op_symbols: if len(parsed_values) > 0: parsed_values[-1] = token else: parsed_values = [token] else: if variable_substitution_dict != None and token in variable_substitution_dict: token = variable_substitution_dict[token] parsed_values.append(token) i += 1 self.__tl_idx += 1 # If we have an extra start symbol, remove the default one if parsed_values[1] in op_symbols: parsed_values = parsed_values[1:] return parsed_values return inner_parse(token_list, False) class SpiresToInvenioSyntaxConverter: """Converts queries defined with SPIRES search syntax into queries that use Invenio search syntax. """ # Constants defining fields _DATE_ADDED_FIELD = 'datecreated:' _DATE_UPDATED_FIELD = 'datemodified:' _DATE_FIELD = 'year:' _A_TAG = 'author:' _EA_TAG = 'exactauthor:' - # Dictionary containing the matches between SPIRES keywords # and their corresponding Invenio keywords or fields # SPIRES keyword : Invenio keyword or field _SPIRES_TO_INVENIO_KEYWORDS_MATCHINGS = { # address 'address' : 'address:', # affiliation 'affiliation' : 'affiliation:', 'affil' : 'affiliation:', 'aff' : 'affiliation:', 'af' : 'affiliation:', 'institution' : 'affiliation:', 'inst' : 'affiliation:', # any field 'any' : 'anyfield:', # author count 'ac' : 'authorcount:', # bulletin 'bb' : 'reportnumber:', 'bbn' : 'reportnumber:', 'bull' : 'reportnumber:', 'bulletin-bd' : 'reportnumber:', 'bulletin-bd-no' : 'reportnumber:', 'eprint' : 'reportnumber:', # citation / reference 'c' : 'reference:', 'citation' : 'reference:', 'cited' : 'reference:', 'jour-vol-page' : 'reference:', 'jvp' : 'reference:', # collaboration 'collaboration' : 'collaboration:', 'collab-name' : 'collaboration:', 'cn' : 'collaboration:', # conference number 'conf-number' : '111__g:', 'cnum' : '773__w:', # country 'cc' : '044__a:', 'country' : '044__a:', # date 'date': _DATE_FIELD, 'd': _DATE_FIELD, # date added 'date-added': _DATE_ADDED_FIELD, 'dadd': _DATE_ADDED_FIELD, 'da': _DATE_ADDED_FIELD, # date updated 'date-updated': _DATE_UPDATED_FIELD, 'dupd': _DATE_UPDATED_FIELD, 'du': _DATE_UPDATED_FIELD, # first author 'fa' : 'firstauthor:', 'first-author' : 'firstauthor:', # author 'a' : 'author:', 'au' : 'author:', 'author' : 'author:', 'name' : 'author:', # exact author # this is not a real keyword match. It is pseudo keyword that # will be replaced later with author search 'ea' : 'exactauthor:', 'exact-author' : 'exactauthor:', # experiment 'exp' : 'experiment:', 'experiment' : 'experiment:', 'expno' : 'experiment:', 'sd' : 'experiment:', 'se' : 'experiment:', # journal 'journal' : 'journal:', 'j' : 'journal:', 'published_in' : 'journal:', 'spicite' : 'journal:', 'vol' : 'journal:', # journal page 'journal-page' : '773__c:', 'jp' : '773__c:', # journal year 'journal-year' : '773__y:', 'jy' : '773__y:', # key 'key' : '970__a:', 'irn' : '970__a:', 'record' : '970__a:', 'document' : '970__a:', 'documents' : '970__a:', # keywords 'k' : 'keyword:', 'keywords' : 'keyword:', 'kw' : 'keyword:', # note 'note' : '500__a:', # old title 'old-title' : '246__a:', 'old-t' : '246__a:', 'ex-ti' : '246__a:', 'et' : '246__a:', #postal code 'postalcode' : 'postalcode:', 'zip' : 'postalcode:', 'cc' : 'postalcode:', # ppf subject 'ppf-subject' : '650__a:', 'status' : '650__a:', # recid 'recid' : 'recid:', # report number 'r' : 'reportnumber:', 'rn' : 'reportnumber:', 'rept' : 'reportnumber:', 'report' : 'reportnumber:', 'report-num' : 'reportnumber:', # title 't' : 'title:', 'ti' : 'title:', 'title' : 'title:', 'with-language' : 'title:', # fulltext 'fulltext' : 'fulltext:', 'ft' : 'fulltext:', # topic 'topic' : '695__a:', 'tp' : '695__a:', 'hep-topic' : '695__a:', 'desy-keyword' : '695__a:', 'dk' : '695__a:', # topcite 'topcit' : 'cited:', 'topcite' : 'cited:', # captions 'caption' : 'caption:', # category 'arx' : '037__c:', 'category' : '037__c:', # primarch 'parx' : '037__c:', 'primarch' : '037__c:', # texkey 'texkey' : '035__z:', # type code 'tc' : 'collection:', 'ty' : 'collection:', 'type' : 'collection:', 'type-code' : 'collection:', 'scl': 'collection:', 'ps': 'collection:', # field code 'f' : 'subject:', 'fc' : 'subject:', 'field' : 'subject:', 'field-code' : 'subject:', 'subject' : 'subject:', # coden 'bc' : 'journal:', 'browse-only-indx' : 'journal:', 'coden' : 'journal:', 'journal-coden' : 'journal:', # jobs specific codes 'job' : 'title:', 'position' : 'title:', 'region' : 'region:', 'continent' : 'region:', 'deadline' : '046__a:', 'rank' : 'rank:', # replace all the keywords without match with empty string # this will remove the noise from the unknown keywrds in the search # and will in all fields for the words following the keywords # energy 'e' : '', 'energy' : '', 'energyrange-code' : '', # exact experiment number 'ee' : '', 'exact-exp' : '', 'exact-expno' : '', # hidden note 'hidden-note' : '', 'hn' : '', # ppf 'ppf' : '', 'ppflist' : '', # slac topics 'ppfa' : '', 'slac-topics' : '', 'special-topics' : '', 'stp' : '', # test index 'test' : '', 'testindex' : '', } _SECOND_ORDER_KEYWORD_MATCHINGS = { 'rawref' : 'rawref:', 'refersto' : 'refersto:', 'refs': 'refersto:', 'citedby' : 'citedby:' } _INVENIO_KEYWORDS_FOR_SPIRES_PHRASE_SEARCHES = [ 'affiliation:', #'cited:', # topcite is technically a phrase index - this isn't necessary '773__y:', # journal-year '773__c:', # journal-page '773__w:', # cnum '044__a:', # country code 'subject:', # field code 'collection:', # type code '035__z:', # texkey # also exact expno, corp-auth, url, abstract, doi, mycite, citing # but we have no invenio equivalents for these ATM ] def __init__(self): """Initialize the state of the converter""" self._months = {} self._month_name_to_month_number = {} self._init_months() self._compile_regular_expressions() def _compile_regular_expressions(self): """Compiles some of the regular expressions that are used in the class for higher performance.""" # regular expression that matches the contents in single and double quotes # taking in mind if they are escaped. self._re_quotes_match = re.compile(r'(?![\\])(".*?[^\\]")' + r"|(?![\\])('.*?[^\\]')") # match cases where a keyword distributes across a conjunction self._re_distribute_keywords = re.compile(r'''(?ix) # verbose, ignorecase on \b(?P\S*:) # a keyword is anything that's not whitespace with a colon (?P[^:]+?)\s* # content is the part that comes after the keyword; it should NOT # have colons in it! that implies that we might be distributing # a keyword OVER another keyword. see ticket #701 (?P\ and\ not\ |\ and\ |\ or\ |\ not\ )\s* (?P[^:]*?) # oh look, content without a keyword! (?=\ and\ |\ or\ |\ not\ |$)''') # massaging SPIRES quirks self._re_pattern_IRN_search = re.compile(r'970__a:(?P\d+)') self._re_topcite_match = re.compile(r'(?Pcited:\d+)\+') # regular expression that matches author patterns # and author patterns with second-order-ops on top # does not match names with " or ' around them, since # those should not be touched self._re_author_match = re.compile(r'''(?ix) # verbose, ignorecase \b((?P[^\s]+:)?) # do we have a second-order-op on top? ((?Pfirst)?)author:(?P [^\'\"] # first character not a quotemark [^()]*? # some stuff that isn't parentheses (that is dealt with in pp) [^\'\"]) # last character not a quotemark (?=\ and\ not\ |\ and\ |\ or\ |\ not\ |$)''') # regular expression that matches exact author patterns # the group defined in this regular expression is used in method # _convert_spires_exact_author_search_to_invenio_author_search(...) # in case of changes correct also the code in this method self._re_exact_author_match = re.compile(r'\b((?P[^\s]+:)?)exactauthor:(?P[^\'\"].*?[^\'\"]\b)(?= and not | and | or | not |$)', re.IGNORECASE) # match a second-order operator with no operator following it self._re_second_order_op_no_index_match = re.compile(r'''(?ix) # ignorecase, verbose (^|\b|:)(?P(refersto|citedby):) (?P[^\"\'][^:]+?) # anything without an index should be absorbed here \s* (?P(\ and\ |\ not\ |\ or\ |\ \w+:\w+|$)) ''') # match search term, its content (words that are searched) and # the operator preceding the term. self._re_search_term_pattern_match = re.compile(r'\b(?Pfind|and|or|not)\s+(?P\S+:)(?P.+?)(?= and not | and | or | not |$)', re.IGNORECASE) # match journal searches self._re_search_term_is_journal = re.compile(r'''(?ix) # verbose, ignorecase \b(?P(find|and|or|not)\s+journal:) # first combining operator and index (?P.+?) # what we are searching (?=\ and\ not\ |\ and\ |\ or\ |\ not\ |$)''') # regular expression matching date after pattern self._re_date_after_match = re.compile(r'\b(?Pd|date|dupd|dadd|da|date-added|du|date-updated)\b\s*(after|>)\s*(?P.+?)(?= and not | and | or | not |$)', re.IGNORECASE) # regular expression matching date after pattern self._re_date_before_match = re.compile(r'\b(?Pd|date|dupd|dadd|da|date-added|du|date-updated)\b\s*(before|<)\s*(?P.+?)(?= and not | and | or | not |$)', re.IGNORECASE) # match date searches which have been keyword-substituted self._re_keysubbed_date_expr = re.compile(r'\b(?P(' + self._DATE_ADDED_FIELD + ')|(' + self._DATE_UPDATED_FIELD + ')|(' + self._DATE_FIELD + '))(?P.+?)(?= and not | and | or | not |$)', re.IGNORECASE) # for finding (and changing) a variety of different SPIRES search keywords self._re_spires_find_keyword = re.compile('^(f|fin|find)\s+', re.IGNORECASE) # for finding boolean expressions self._re_boolean_expression = re.compile(r' and | or | not | and not ') # patterns for subbing out spaces within quotes temporarily self._re_pattern_single_quotes = re.compile("'(.*?)'") self._re_pattern_double_quotes = re.compile("\"(.*?)\"") self._re_pattern_regexp_quotes = re.compile("\/(.*?)\/") self._re_pattern_space = re.compile("__SPACE__") self._re_pattern_equals = re.compile("__EQUALS__") + # for date math: + self._re_datemath = re.compile(r'(?P.+)\s+(?P[-+])\s+(?P\d+)') + + def is_applicable(self, query): """Is this converter applicable to this query? Return true if query begins with find, fin, or f, or if it contains a SPIRES-specific keyword (a, t, etc.), or if it contains the invenio author: field search. """ if not CFG_WEBSEARCH_SPIRES_SYNTAX: #SPIRES syntax is switched off return False query = query.lower() if self._re_spires_find_keyword.match(query): #leading 'find' is present and SPIRES syntax is switched on return True if CFG_WEBSEARCH_SPIRES_SYNTAX > 1: for word in query.split(' '): if self._SPIRES_TO_INVENIO_KEYWORDS_MATCHINGS.has_key(word): return True return False def convert_query(self, query): """Convert SPIRES syntax queries to Invenio syntax. Do nothing to queries not in SPIRES syntax.""" # SPIRES syntax allows searches with 'find' or 'fin'. if self.is_applicable(query): query = re.sub(self._re_spires_find_keyword, 'find ', query) if not query.startswith('find'): query = 'find ' + query # a holdover from SPIRES syntax is e.g. date = 2000 rather than just date 2000 query = self._remove_extraneous_equals_signs(query) # these calls are before keywords replacement because when keywords # are replaced, date keyword is replaced by specific field search # and the DATE keyword is not match in DATE BEFORE or DATE AFTER query = self._convert_spires_date_before_to_invenio_span_query(query) query = self._convert_spires_date_after_to_invenio_span_query(query) # call to _replace_spires_keywords_with_invenio_keywords should be at the # beginning because the next methods use the result of the replacement query = self._standardize_already_invenio_keywords(query) query = self._replace_spires_keywords_with_invenio_keywords(query) query = self._normalise_journal_page_format(query) query = self._distribute_keywords_across_combinations(query) query = self._distribute_and_quote_second_order_ops(query) query = self._convert_dates(query) query = self._convert_irns_to_spires_irns(query) query = self._convert_topcite_to_cited(query) query = self._convert_spires_author_search_to_invenio_author_search(query) query = self._convert_spires_exact_author_search_to_invenio_author_search(query) query = self._convert_spires_truncation_to_invenio_truncation(query) query = self._expand_search_patterns(query) # remove FIND in the beginning of the query as it is not necessary in Invenio query = query[4:] query = query.strip() return query def _init_months(self): """Defines a dictionary matching the name of the month with its corresponding number""" # this dictionary is used when generating match patterns for months self._months = {'jan':'01', 'january':'01', 'feb':'02', 'february':'02', 'mar':'03', 'march':'03', 'apr':'04', 'april':'04', 'may':'05', 'may':'05', 'jun':'06', 'june':'06', 'jul':'07', 'july':'07', 'aug':'08', 'august':'08', 'sep':'09', 'september':'09', 'oct':'10', 'october':'10', 'nov':'11', 'november':'11', 'dec':'12', 'december':'12'} # this dictionary is used to transform name of the month # to a number used in the date format. By this reason it # contains also the numbers itself to simplify the conversion self._month_name_to_month_number = {'1':'01', '01':'01', '2':'02', '02':'02', '3':'03', '03':'03', '4':'04', '04':'04', '5':'05', '05':'05', '6':'06', '06':'06', '7':'07', '07':'07', '8':'08', '08':'08', '9':'09', '09':'09', '10':'10', '11':'11', '12':'12',} # combine it with months in order to cover all the cases self._month_name_to_month_number.update(self._months) def _get_month_names_match(self): """Retruns part of a patter that matches month in a date""" months_match = '' for month_name in self._months.keys(): months_match = months_match + month_name + '|' months_match = r'\b(' + months_match[0:-1] + r')\b' return months_match def _convert_dates(self, query): """Tries to find dates in query and make them look like ISO-8601.""" + def parse_relative_unit(date_str): + units = 0 + datemath = self._re_datemath.match(date_str) + if datemath: + date_str = datemath.group('datestamp') + units = int(datemath.group('operator') + datemath.group('units')) + return date_str, units + + def guess_best_year(d): + if d.year > datetime.today().year + 10: + return d - du_delta(years=100) + else: + return d + + def parse_date_unit(date_str): + begin = date_str + end = None + + # First split, relative time directive + # e.g. "2012-01-01 - 3" to ("2012-01-01", -3) + date_str, relative_units = parse_relative_unit(date_str) + + try: + d = datetime.strptime(date_str, '%Y-%m-%d') + d += du_delta(days=relative_units) + return datetime.strftime(d, '%Y-%m-%d'), end + except ValueError: + pass + + try: + d = datetime.strptime(date_str, '%y-%m-%d') + d += du_delta(days=relative_units) + d = guess_best_year(d) + return datetime.strftime(d, '%Y-%m-%d'), end + except ValueError: + pass + + try: + d = datetime.strptime(date_str, '%Y-%m') + d += du_delta(months=relative_units) + return datetime.strftime(d, '%Y-%m'), end + except ValueError: + pass + + try: + d = datetime.strptime(date_str, '%Y') + d += du_delta(years=relative_units) + return datetime.strftime(d, '%Y'), end + except ValueError: + pass + + try: + d = datetime.strptime(date_str, '%y') + d += du_delta(days=relative_units) + d = guess_best_year(d) + return datetime.strftime(d, '%Y'), end + except ValueError: + pass + + try: + d = datetime.strptime(date_str, '%b %y') + d = guess_best_year(d) + return datetime.strftime(d, '%Y-%m'), end + except ValueError: + pass + + if 'this week' in date_str: + # Past monday to today + # This week is iffy, not sure if we should + # start with sunday or monday + begin = datetime.today() + begin += du_delta(weekday=relativedelta.SU(-1)) + end = datetime.today() + begin = datetime.strftime(begin, '%Y-%m-%d') + end = datetime.strftime(end, '%Y-%m-%d') + elif 'last week' in date_str: + # Past monday to today + # Same problem as last week + begin = datetime.today() + begin += du_delta(weekday=relativedelta.SU(-2)) + end = datetime.today() + end += du_delta(weekday=relativedelta.SA(-1)) + begin = datetime.strftime(begin, '%Y-%m-%d') + end = datetime.strftime(end, '%Y-%m-%d') + elif 'this month' in date_str: + d = datetime.today() + begin = datetime.strftime(d, '%Y-%m') + elif 'last month' in date_str: + d = datetime.today() - du_delta(months=1) + begin = datetime.strftime(d, '%Y-%m') + elif 'yesterday' in date_str: + d = datetime.today() - du_delta(days=1) + begin = datetime.strftime(d, '%Y-%m-%d') + elif 'today' in date_str: + start = datetime.today() + start += du_delta(days=relative_units) + begin = datetime.strftime(start, '%Y-%m-%d') + elif date_str.strip() == '0': + begin = '0' + else: + default = datetime(datetime.today().year, 1, 1) + try: + d = du_parser.parse(date_str, default=default) + except ValueError: + begin = date_str + else: + begin = datetime.strftime(d, '%Y-%m-%d') + + return begin, end + def mangle_with_dateutils(query): - DEFAULT = datetime(datetime.today().year, 1, 1) result = '' position = 0 for match in self._re_keysubbed_date_expr.finditer(query): result += query[position : match.start()] + datestamp = match.group('content') + if '->' in datestamp: + begin_unit, end_unit = datestamp.split('->', 1) + begin, dummy = parse_date_unit(begin_unit) + end, dummy = parse_date_unit(end_unit) + else: + begin, end = parse_date_unit(datestamp) + + if end: + daterange = '%s->%s' % (begin, end) + else: + daterange = begin - isodates = [] - dates = match.group('content').split('->') # Warning: generalizing but should only ever be 2 items - for datestamp in dates: - if datestamp != None: - if re.match('[0-9]{1,4}$', datestamp): - isodates.append(datestamp) - else: - units = 0 - datestamp = re.sub('yesterday', datetime.strftime(datetime.today() - +du_delta(days=-1), '%Y-%m-%d'), - datestamp) - datestamp = re.sub('today', datetime.strftime(datetime.today(), '%Y-%m-%d'), datestamp) - datestamp = re.sub('this week', datetime.strftime(datetime.today() - +du_delta(days=-(datetime.today().isoweekday()%7)), '%Y-%m-%d'), - datestamp) - datestamp = re.sub('last week', datetime.strftime(datetime.today() - +du_delta(days=-((datetime.today().isoweekday()%7)+7)), '%Y-%m-%d'), - datestamp) - datestamp = re.sub('this month', datetime.strftime(datetime.today(), '%Y-%m'), - datestamp) - datestamp = re.sub('last month', datetime.strftime(datetime.today() - +du_delta(months=-1), '%Y-%m'), - datestamp) - datemath = re.match(r'(?P.+)\s+(?P[-+])\s+(?P\d+)', datestamp) - if datemath: - datestamp = datemath.group('datestamp') - units += int(datemath.group('operator') + datemath.group('units')) - try: - dtobj = du_parser.parse(datestamp, default=DEFAULT) - dtobj = dtobj + du_delta(days=units) - if dtobj.day == 1: - isodates.append("%d-%02d" % (dtobj.year, dtobj.month)) - else: - isodates.append("%d-%02d-%02d" % (dtobj.year, dtobj.month, dtobj.day)) - except ValueError: - isodates.append(datestamp) - - daterange = '->'.join(isodates) result += match.group('term') + daterange position = match.end() result += query[position : ] return result if GOT_DATEUTIL: query = mangle_with_dateutils(query) # else do nothing with the dates return query def _convert_irns_to_spires_irns(self, query): """Prefix IRN numbers with SPIRES- so they match the INSPIRE format.""" def create_replacement_pattern(match): """method used for replacement with regular expression""" return '970__a:SPIRES-' + match.group('irn') query = self._re_pattern_IRN_search.sub(create_replacement_pattern, query) return query def _convert_topcite_to_cited(self, query): """Replace SPIRES topcite x+ with cited:x->999999999""" def create_replacement_pattern(match): """method used for replacement with regular expression""" return match.group('x') + '->999999999' query = self._re_topcite_match.sub(create_replacement_pattern, query) return query def _convert_spires_date_after_to_invenio_span_query(self, query): """Converts date after SPIRES search term into invenio span query""" def create_replacement_pattern(match): """method used for replacement with regular expression""" return match.group('searchop') + ' ' + match.group('search_content') + '->9999' query = self._re_date_after_match.sub(create_replacement_pattern, query) return query def _convert_spires_date_before_to_invenio_span_query(self, query): """Converts date before SPIRES search term into invenio span query""" # method used for replacement with regular expression def create_replacement_pattern(match): return match.group('searchop') + ' ' + '0->' + match.group('search_content') query = self._re_date_before_match.sub(create_replacement_pattern, query) return query def _expand_search_patterns(self, query): """Expands search queries. If a search term is followed by several words e.g. author:ellis or title:THESE THREE WORDS it is expanded to author:ellis or (title:THESE and title:THREE...) All keywords are thus expanded. XXX: this may lead to surprising results for any later parsing stages if we're not careful. """ def create_replacements(term, content): result = '' content = content.strip() # replace spaces within quotes by __SPACE__ temporarily: content = self._re_pattern_single_quotes.sub(lambda x: "'"+string.replace(x.group(1), ' ', '__SPACE__')+"'", content) content = self._re_pattern_double_quotes.sub(lambda x: "\""+string.replace(x.group(1), ' ', '__SPACE__')+"\"", content) content = self._re_pattern_regexp_quotes.sub(lambda x: "/"+string.replace(x.group(1), ' ', '__SPACE__')+"/", content) if term in self._INVENIO_KEYWORDS_FOR_SPIRES_PHRASE_SEARCHES \ and not self._re_boolean_expression.search(content) and ' ' in content: # the case of things which should be searched as phrases result = term + '"' + content + '"' else: words = content.split() if len(words) == 0: # this should almost never happen, req user to say 'find a junk:' result = term elif len(words) == 1: # this is more common but still occasional result = term + words[0] else: # general case result = '(' + term + words[0] for word in words[1:]: result += ' and ' + term + word result += ')' # replace back __SPACE__ by spaces: result = self._re_pattern_space.sub(" ", result) return result.strip() result = '' current_position = 0 for match in self._re_search_term_pattern_match.finditer(query): result += query[current_position : match.start()] result += ' ' + match.group('combine_operator') + ' ' result += create_replacements(match.group('search_term'), match.group('search_content')) current_position = match.end() result += query[current_position : len(query)] return result.strip() def _remove_extraneous_equals_signs(self, query): """In SPIRES, both date = 2000 and date 2000 are acceptable. Get rid of the =""" query = self._re_pattern_single_quotes.sub(lambda x: "'"+string.replace(x.group(1), '=', '__EQUALS__')+"'", query) query = self._re_pattern_double_quotes.sub(lambda x: "\""+string.replace(x.group(1), '=', '__EQUALS__')+'\"', query) query = self._re_pattern_regexp_quotes.sub(lambda x: "/"+string.replace(x.group(1), '=', '__EQUALS__')+"/", query) query = query.replace('=', '') query = self._re_pattern_equals.sub("=", query) return query def _convert_spires_truncation_to_invenio_truncation(self, query): """Replace SPIRES truncation symbol # with invenio trancation symbol *""" return query.replace('#', '*') def _convert_spires_exact_author_search_to_invenio_author_search(self, query): """Converts SPIRES search patterns for exact author into search pattern for invenio""" # method used for replacement with regular expression def create_replacement_pattern(match): # the regular expression where this group name is defined is in # the method _compile_regular_expressions() return self._EA_TAG + '"' + match.group('author_name') + '"' query = self._re_exact_author_match.sub(create_replacement_pattern, query) return query def _convert_spires_author_search_to_invenio_author_search(self, query): """Converts SPIRES search patterns for authors to search patterns in invenio that give similar results to the spires search. """ # result of the replacement result = '' current_position = 0 for match in self._re_author_match.finditer(query): result += query[current_position : match.start() ] if match.group('secondorderop'): result += match.group('secondorderop') scanned_name = NameScanner.scan(match.group('name')) author_atoms = self._create_author_search_pattern_from_fuzzy_name_dict(scanned_name) if match.group('first'): author_atoms = author_atoms.replace('author:', 'firstauthor:') if author_atoms.find(' ') == -1: result += author_atoms + ' ' else: result += '(' + author_atoms + ') ' current_position = match.end() result += query[current_position : len(query)] return result def _create_author_search_pattern_from_fuzzy_name_dict(self, fuzzy_name): """Creates an invenio search pattern for an author from a fuzzy name dict""" author_name = '' author_middle_name = '' author_surname = '' full_search = '' if len(fuzzy_name['nonlastnames']) > 0: author_name = fuzzy_name['nonlastnames'][0] if len(fuzzy_name['nonlastnames']) == 2: author_middle_name = fuzzy_name['nonlastnames'][1] if len(fuzzy_name['nonlastnames']) > 2: author_middle_name = ' '.join(fuzzy_name['nonlastnames'][1:]) if fuzzy_name['raw']: full_search = fuzzy_name['raw'] author_surname = ' '.join(fuzzy_name['lastnames']) NAME_IS_INITIAL = (len(author_name) == 1) NAME_IS_NOT_INITIAL = not NAME_IS_INITIAL # we expect to have at least surname if author_surname == '' or author_surname == None: return '' # ellis ---> "author:ellis" #if author_name == '' or author_name == None: if not author_name: return self._A_TAG + author_surname # ellis, j ---> "ellis, j*" if NAME_IS_INITIAL and not author_middle_name: return self._A_TAG + '"' + author_surname + ', ' + author_name + '*"' # if there is middle name we expect to have also name and surname # ellis, j. r. ---> ellis, j* r* # j r ellis ---> ellis, j* r* # ellis, john r. ---> ellis, j* r* or ellis, j. r. or ellis, jo. r. # ellis, john r. ---> author:ellis, j* r* or exactauthor:ellis, j r or exactauthor:ellis jo r if author_middle_name: search_pattern = self._A_TAG + '"' + author_surname + ', ' + author_name + '*' + ' ' + author_middle_name.replace(" ","* ") + '*"' if NAME_IS_NOT_INITIAL: for i in range(1, len(author_name)): search_pattern += ' or ' + self._EA_TAG + "\"%s, %s %s\"" % (author_surname, author_name[0:i], author_middle_name) return search_pattern # ellis, jacqueline ---> "ellis, jacqueline" or "ellis, j.*" or "ellis, j" or "ellis, ja.*" or "ellis, ja" or "ellis, jacqueline *, ellis, j *" # in case we don't use SPIRES data, the ending dot is ommited. search_pattern = self._A_TAG + '"' + author_surname + ', ' + author_name + '*"' search_pattern += " or " + self._EA_TAG + "\"%s, %s *\"" % (author_surname, author_name[0]) if NAME_IS_NOT_INITIAL: for i in range(1,len(author_name)): search_pattern += ' or ' + self._EA_TAG + "\"%s, %s\"" % (author_surname, author_name[0:i]) search_pattern += ' or %s"%s, *"' % (self._A_TAG, full_search) return search_pattern def _normalise_journal_page_format(self, query): """Phys.Lett, 0903, 024 -> Phys.Lett,0903,024""" def _is_triple(search): return (len(re.findall('\s+', search)) + len(re.findall(':', search))) == 2 def _normalise_spaces_and_colons_to_commas_in_triple(search): if not _is_triple(search): return search search = re.sub(',\s+', ',', search) search = re.sub('\s+', ',', search) search = re.sub(':', ',', search) return search result = "" current_position = 0 for match in self._re_search_term_is_journal.finditer(query): result += query[current_position : match.start()] result += match.group('leading') search = match.group('search_content') search = _normalise_spaces_and_colons_to_commas_in_triple(search) result += search current_position = match.end() result += query[current_position : ] return result def _standardize_already_invenio_keywords(self, query): """Replaces invenio keywords kw with "and kw" in order to parse them correctly further down the line.""" unique_invenio_keywords = set(self._SPIRES_TO_INVENIO_KEYWORDS_MATCHINGS.values()) |\ set(self._SECOND_ORDER_KEYWORD_MATCHINGS.values()) unique_invenio_keywords.remove('') # for the ones that don't have invenio equivalents for invenio_keyword in unique_invenio_keywords: query = re.sub("(?(^find|\band|\bor|\bnot|\brefersto|\bcitedby|^)\b[:\s\(]*)' + \ old_keyword + r'(?P[\s\(]+|$)' regular_expression = re.compile(regex_string, re.IGNORECASE) result = regular_expression.sub(r'\g' + new_keyword + r'\g', query) result = re.sub(':\s+', ':', result) return result def _replace_second_order_keyword(self, query, old_keyword, new_keyword): """Replaces old second-order keyword in the query with a new keyword""" regular_expression =\ re.compile(r'''(?ix) # verbose, ignorecase (?P (^find|\band|\bor|\bnot|\brefersto|\bcitedby|^)\b # operator preceding our operator [:\s\(]* # trailing colon, spaces, parens, etc. for that operator ) %s # the keyword we're searching for (?P \s*[a-z]+:| # either an operator (like author:) [\s\(]+| # or a paren opening $ # or the end of the string )''' % old_keyword) result = regular_expression.sub(r'\g' + new_keyword + r'\g', query) result = re.sub(':\s+', ':', result) return result def _distribute_keywords_across_combinations(self, query): """author:ellis and james -> author:ellis and author:james""" # method used for replacement with regular expression def create_replacement_pattern(match): return match.group('keyword') + match.group('content') + \ match.group('combination') + match.group('keyword') + \ match.group('last_content') still_matches = True while still_matches: query = self._re_distribute_keywords.sub(create_replacement_pattern, query) still_matches = self._re_distribute_keywords.search(query) query = re.sub(r'\s+', ' ', query) return query def _distribute_and_quote_second_order_ops(self, query): """refersto:s parke -> refersto:\"s parke\"""" def create_replacement_pattern(match): return match.group('second_order_op') + '"' +\ match.group('search_terms') + '"' +\ match.group('conjunction_or_next_keyword') for match in self._re_second_order_op_no_index_match.finditer(query): query = self._re_second_order_op_no_index_match.sub(create_replacement_pattern, query) query = re.sub(r'\s+', ' ', query) return query diff --git a/modules/websearch/lib/search_engine_query_parser_unit_tests.py b/modules/websearch/lib/search_engine_query_parser_unit_tests.py index 83198bfc7..b4dda4404 100644 --- a/modules/websearch/lib/search_engine_query_parser_unit_tests.py +++ b/modules/websearch/lib/search_engine_query_parser_unit_tests.py @@ -1,1019 +1,1070 @@ # -*- coding: utf-8 -*- ## ## This file is part of Invenio. -## Copyright (C) 2008, 2010, 2011, 2012 CERN. +## Copyright (C) 2008, 2010, 2011, 2012, 2013 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """Unit tests for the search engine query parsers.""" import unittest import datetime from invenio import search_engine_query_parser from invenio.testutils import make_test_suite, run_test_suite from invenio.search_engine import create_basic_search_units, perform_request_search from invenio.config import CFG_WEBSEARCH_SPIRES_SYNTAX if search_engine_query_parser.GOT_DATEUTIL: import dateutil + from dateutil.relativedelta import relativedelta as du_delta DATEUTIL_AVAILABLE = True else: DATEUTIL_AVAILABLE = False class TestParserUtilityFunctions(unittest.TestCase): """Test utility functions for the parsing components""" def setUp(self): self.parser = search_engine_query_parser.SearchQueryParenthesisedParser() self.converter = search_engine_query_parser.SpiresToInvenioSyntaxConverter() def test_ndb_simple(self): """SQPP.test_nesting_depth_and_balance: ['p0']""" self.assertEqual((0, True, 0), self.parser.nesting_depth_and_balance(['p0'])) def test_ndb_simple_useful(self): """SQPP.test_nesting_depth_and_balance: ['(', 'p0', ')']""" self.assertEqual((1, True, 1), self.parser.nesting_depth_and_balance(['(', 'p0', ')'])) def test_ndb_slightly_complicated(self): """SQPP.test_nesting_depth_and_balance: ['(', 'p0', ')', '|', '(', 'p2', '+', 'p3', ')']""" self.assertEqual((1, True, 2), self.parser.nesting_depth_and_balance(['(', 'p0', ')', '|', '(', 'p2', '+', 'p3', ')'])) def test_ndb_sorta_hairy(self): """SQPP.test_nesting_depth_and_balance: ['(', '(', ')', ')', '(', '(', '(', ')', ')', ')']""" self.assertEqual((3, True, 2), self.parser.nesting_depth_and_balance(['(', '(', ')', ')', '(', '(', '(', ')', ')', ')'])) def test_ndb_broken_rhs(self): """SQPP.test_nesting_depth_and_balance: ['(', '(', ')', ')', '(', '(', '(', ')', ')', ]""" self.assertEqual((3, False, 2), self.parser.nesting_depth_and_balance(['(', '(', ')', ')', '(', '(', '(', ')', ')', ])) def test_ndb_broken_lhs(self): """SQPP.test_nesting_depth_and_balance: ['(', ')', ')', '(', '(', '(', ')', ')', ')']""" self.assertEqual((3, False, 2), self.parser.nesting_depth_and_balance(['(', ')', ')', '(', '(', '(', ')', ')', ])) def test_stisc(self): """Test whole convert/parse stack: SQPP.parse_query(STISC.convert_query('find a richter, burton and t quark'))""" self.assertEqual(self.parser.parse_query(self.converter.convert_query('find a richter, burton and t quark')), ['+', 'author:"richter, burton*" | exactauthor:"richter, b *" | exactauthor:"richter, b" | exactauthor:"richter, bu" | exactauthor:"richter, bur" | exactauthor:"richter, burt" | exactauthor:"richter, burto" | author:"richter, burton, *"', '+', 'title:quark']) def test_stisc_not_vs_and_not1(self): """Parse stack parses "find a ellis, j and not a enqvist" == "find a ellis, j not a enqvist" """ self.assertEqual(self.parser.parse_query(self.converter.convert_query('find a ellis, j and not a enqvist')), self.parser.parse_query(self.converter.convert_query('find a ellis, j not a enqvist'))) def test_stisc_not_vs_and_not2(self): """Parse stack parses "find a mangano, m and not a ellis, j" == "find a mangano, m not a ellis, j" """ self.assertEqual(self.parser.parse_query(self.converter.convert_query('find a mangano, m and not a ellis, j')), self.parser.parse_query(self.converter.convert_query('find a mangano, m not a ellis, j'))) class TestSearchQueryParenthesisedParser(unittest.TestCase): """Test parenthesis parsing.""" def setUp(self): self.parser = search_engine_query_parser.SearchQueryParenthesisedParser() def test_sqpp_atom(self): """SearchQueryParenthesisedParser - expr1""" self.assertEqual(self.parser.parse_query('expr1'), ['+', 'expr1']) def test_sqpp_parened_atom(self): """SearchQueryParenthesisedParser - (expr1)""" self.assertEqual(self.parser.parse_query('(expr1)'), ['+', 'expr1']) def test_sqpp_expr1_minus_expr2(self): """SearchQueryParenthesisedParser - expr1 - (expr2)""" self.assertEqual(self.parser.parse_query("expr1 - (expr2)"), ['+', 'expr1', '-', 'expr2']) def test_sqpp_plus_expr1_minus_paren_expr2(self): """SearchQueryParenthesisedParser - + expr1 - (expr2)""" self.assertEqual(self.parser.parse_query("+ expr1 - (expr2)"), ['+', 'expr1', '-', 'expr2']) def test_sqpp_expr1_paren_expr2(self): """SearchQueryParenthesisedParser - expr1 (expr2)""" self.assertEqual(self.parser.parse_query("expr1 (expr2)"), ['+', 'expr1', '+', 'expr2']) def test_sqpp_paren_expr1_minus_expr2(self): """SearchQueryParenthesisedParser - (expr1) - expr2""" self.assertEqual(self.parser.parse_query("(expr1) - expr2"), ['+', 'expr1', '-', 'expr2']) def test_sqpp_paren_expr1_minus_paren_expr2(self): """SearchQueryParenthesisedParser - (expr1)-(expr2)""" self.assertEqual(self.parser.parse_query("(expr1)-(expr2)"), ['+', 'expr1', '-', 'expr2']) def test_sqpp_minus_paren_expr1_minus_paren_expr2(self): """SearchQueryParenthesisedParser - -(expr1)-(expr2)""" self.assertEqual(self.parser.parse_query("-(expr1)-(expr2)"), ['-', 'expr1', '-', 'expr2']) def test_sqpp_paren_expr1_minus_expr2_and_paren_expr3(self): """SearchQueryParenthesisedParser - (expr1) - expr2 + (expr3)""" self.assertEqual(self.parser.parse_query('(expr1) - expr2 + (expr3)'), ['+', 'expr1', '-', 'expr2', '+', 'expr3']) def test_sqpp_paren_expr1_minus_expr2_and_paren_expr3_or_expr4(self): """SearchQueryParenthesisedParser - (expr1) - expr2 + (expr3) | expr4""" self.assertEqual(self.parser.parse_query('(expr1) - expr2 + (expr3) | expr4'), ['+', 'expr1', '-', 'expr2', '+', 'expr3', '|', 'expr4']) #['+', '+ expr1 | expr4', '+', '- expr2 | expr4', '+', '+ expr3 | expr4']) def test_sqpp_paren_expr1_minus_expr2_and_paren_expr3_or_expr4_or_quoted_expr5_and_expr6(self): """SearchQueryParenthesisedParser - (expr1) - expr2 + (expr3) | expr4 | \"expr5 + expr6\"""" self.assertEqual(self.parser.parse_query('(expr1) - expr2 + (expr3 | expr4) | "expr5 + expr6"'), ['+', 'expr1', '-', 'expr2', '+', 'expr3 | expr4', '|', '"expr5 + expr6"']), #['+', '+ expr1 | "expr5 + expr6"', '+', '- expr2 | "expr5 + expr6"', # '+', '+ expr3 | expr4 | "expr5 + expr6"']) def test_sqpp_quoted_expr1_and_paren_expr2_and_expr3(self): """SearchQueryParenthesisedParser - \"expr1\" (expr2) expr3""" self.assertEqual(self.parser.parse_query('"expr1" (expr2) expr3'), ['+', '"expr1"', '+', 'expr2', '+', 'expr3']) def test_sqpp_quoted_expr1_arrow_quoted_expr2(self): """SearchQueryParenthesisedParser = \"expr1\"->\"expr2\"""" self.assertEqual(self.parser.parse_query('"expr1"->"expr2"'), ['+', '"expr1"->"expr2"']) def test_sqpp_paren_expr1_expr2_paren_expr3_or_expr4(self): """SearchQueryParenthesisedParser - (expr1) expr2 (expr3) | expr4""" # test parsing of queries with missing operators. # in this case default operator + should be included on place of the missing one self.assertEqual(self.parser.parse_query('(expr1) expr2 (expr3) | expr4'), ['+', 'expr1', '+', 'expr2', '+', 'expr3', '|', 'expr4']) #['+', '+ expr1 | expr4', '+', '+ expr2 | expr4', '+', '+ expr3 | expr4']) def test_sqpp_nested_paren_success(self): """SearchQueryParenthesizedParser - Arbitrarily nested parentheses: ((expr1)) + (expr2 - expr3)""" self.assertEqual(self.parser.parse_query('((expr1)) + (expr2 - expr3)'), ['+', 'expr1', '+', 'expr2', '-', 'expr3']) #['+', 'expr1', '+', 'expr2', '-', 'expr3']) def test_sqpp_nested_paren_really_nested(self): """SearchQueryParenthesisedParser - Nested parentheses where order matters: expr1 - (expr2 - (expr3 | expr4))""" self.assertEqual(self.parser.parse_query('expr1 - (expr2 - (expr3 | expr4))'), ['+', 'expr1', '+', '- expr2 | expr3 | expr4']) def test_sqpp_paren_open_only_failure(self): """SearchQueryParenthesizedParser - Parentheses that only open should raise an exception""" self.failUnlessRaises(SyntaxError, self.parser.parse_query,"(expr") def test_sqpp_paren_close_only_failure(self): """SearchQueryParenthesizedParser - Parentheses that only close should raise an exception""" self.failUnlessRaises(SyntaxError, self.parser.parse_query,"expr)") def test_sqpp_paren_expr1_not_expr2_and_paren_expr3_or_expr4_WORDS(self): """SearchQueryParenthesisedParser - (expr1) not expr2 and (expr3) or expr4""" self.assertEqual(self.parser.parse_query('(expr1) not expr2 and (expr3) or expr4'), ['+', 'expr1', '-', 'expr2', '+', 'expr3', '|', 'expr4']) #['+', '+ expr1 | expr4', '+', '- expr2 | expr4', '+', '+ expr3 | expr4']) def test_sqpp_paren_expr1_not_expr2_or_quoted_string_not_expr3_or_expr4WORDS(self): """SearchQueryParenthesisedParser - (expr1) not expr2 | "expressions not in and quotes | (are) not - parsed " - (expr3) or expr4""" self.assertEqual(self.parser.parse_query('(expr1) not expr2 | "expressions not in and quotes | (are) not - parsed " - (expr3) or expr4'), ['+', 'expr1', '-', 'expr2', '|', '"expressions not in and quotes | (are) not - parsed "', '-', 'expr3', '|', 'expr4']) #['+', '+ "expressions not in and quotes | (are) not - parsed " | expr1 | expr4', # '+', '- expr3 | expr1 | expr4', # '+', '+ "expressions not in and quotes | (are) not - parsed " - expr2 | expr4', # '+', '- expr3 - expr2 | expr4']) def test_sqpp_expr1_escaped_quoted_expr2_and_paren_expr3_not_expr4_WORDS(self): """SearchQueryParenthesisedParser - expr1 \\" expr2 foo(expr3) not expr4 \\" and (expr5)""" self.assertEqual(self.parser.parse_query('expr1 \\" expr2 foo(expr3) not expr4 \\" and (expr5)'), ['+', 'expr1', '+', '\\"', '+', 'expr2', '+', 'foo(expr3)', '-', 'expr4', '+', '\\"', '+', 'expr5']) def test_sqpp_paren_expr1_and_expr2_or_expr3_WORDS(self): """SearchQueryParenthesisedParser - (expr1 and expr2) or expr3""" self.assertEqual(self.parser.parse_query('(expr1 and expr2) or expr3'), ['+', 'expr1 + expr2', '|', 'expr3']) #['+', '+ expr1 | expr3', '+', '+ expr2 | expr3']) def test_sqpp_paren_expr1_and_expr2_or_expr3_WORDS_equiv(self): """SearchQueryParenthesisedParser - (expr1 and expr2) or expr3 == (expr1 + expr2) | expr3""" self.assertEqual(self.parser.parse_query('(expr1 and expr2) or expr3'), self.parser.parse_query('(expr1 + expr2) | expr3')) def test_sqpp_paren_expr1_and_expr2_or_expr3_WORDS_equiv_SYMBOLS(self): """SearchQueryParenthesisedParser - (expr1 and expr2) or expr3 == (expr1 + expr2) or expr3""" self.assertEqual(self.parser.parse_query('(expr1 and expr2) or expr3'), self.parser.parse_query('(expr1 + expr2) or expr3')) def test_sqpp_double_quotes(self): """SearchQueryParenthesisedParser - Test double quotes""" self.assertEqual(self.parser.parse_query( '(expr1) - expr2 | "expressions - in + quotes | (are) not - parsed " - (expr3) | expr4'), ['+', 'expr1', '-', 'expr2', '|', '"expressions - in + quotes | (are) not - parsed "', '-', 'expr3', '|', 'expr4']) #['+', '+ "expressions - in + quotes | (are) not - parsed " | expr1 | expr4', # '+', '- expr3 | expr1 | expr4', # '+', '+ "expressions - in + quotes | (are) not - parsed " - expr2 | expr4', # '+', '- expr3 - expr2 | expr4']) def test_sqpp_single_quotes(self): """SearchQueryParenthesisedParser - Test single quotes""" self.assertEqual(self.parser.parse_query("(expr1) - expr2 | 'expressions - in + quotes | (are) not - parsed ' - (expr3) | expr4"), ['+', 'expr1', '-', 'expr2', '|', "'expressions - in + quotes | (are) not - parsed '", '-', 'expr3', '|', 'expr4']) #['+', '+ \'expressions - in + quotes | (are) not - parsed \' | expr1 | expr4', # '+', '- expr3 | expr1 | expr4', # '+', '+ \'expressions - in + quotes | (are) not - parsed \' - expr2 | expr4', # '+', '- expr3 - expr2 | expr4']) def test_sqpp_escape_single_quotes(self): """SearchQueryParenthesisedParser - Test escaping single quotes""" self.assertEqual(self.parser.parse_query("expr1 \\' expr2 +(expr3) -expr4 \\' + (expr5)"), ['+', 'expr1', '+', "\\'", '+', 'expr2', '+', 'expr3', '-', 'expr4', '+', "\\'", '+', 'expr5']) def test_sqpp_escape_double_quotes(self): """SearchQueryParenthesisedParser - Test escaping double quotes""" self.assertEqual(self.parser.parse_query('expr1 \\" expr2 +(expr3) -expr4 \\" + (expr5)'), ['+', 'expr1', '+', '\\"', '+', 'expr2', '+', 'expr3', '-', 'expr4', '+', '\\"', '+', 'expr5']) def test_sqpp_beginning_double_quotes(self): """SearchQueryParenthesisedParser - Test parsing double quotes at beginning""" self.assertEqual(self.parser.parse_query('"expr1" - (expr2)'), ['+', '"expr1"', '-', 'expr2']) def test_sqpp_beginning_double_quotes_negated(self): """SearchQueryParenthesisedParser - Test parsing negated double quotes at beginning""" self.assertEqual(self.parser.parse_query('-"expr1" - (expr2)'), ['-', '"expr1"', '-', 'expr2']) def test_sqpp_long_or_chain(self): """SearchQueryParenthesisedParser - Test long or chains being parsed flat""" self.assertEqual(self.parser.parse_query('p0 or p1 or p2 or p3 or p4'), ['+', 'p0', '|', 'p1', '|', 'p2', '|', 'p3', '|', 'p4']) def test_sqpp_not_after_recursion(self): """SearchQueryParenthesisedParser - Test operations after recursive calls""" self.assertEqual(self.parser.parse_query('(p0 or p1) not p2'), ['+', 'p0 | p1', '-', 'p2']) #['+', '+ p0 | p1', '-', 'p2']) def test_sqpp_oddly_capped_operators(self): """SearchQueryParenthesisedParser - Test conjunctions in any case""" self.assertEqual(self.parser.parse_query('foo oR bar'), ['+', 'foo', '|', 'bar']) def test_space_before_last_paren(self): """SearchQueryParenthesisedParser - Test (ellis )""" self.assertEqual(self.parser.parse_query('(ellis )'), ['+', 'ellis']) def test_sqpp_nested_U1_or_SL2(self): """SearchQueryParenthesisedParser - Test (U(1) or SL(2,Z))""" self.assertEqual(self.parser.parse_query('(U(1) or SL(2,Z))'), ['+', 'u(1) | sl(2,z)']) def test_sqpp_alternation_of_quote_marks_double(self): """SearchQueryParenthesisedParser - Test refersto:(author:"s parke" or author:ellis)""" self.assertEqual(self.parser.parse_query('refersto:(author:"s parke" or author:ellis)'), ['+', 'refersto:\'author:"s parke" | author:ellis\'']) def test_sqpp_alternation_of_quote_marks_single(self): """SearchQueryParenthesisedParser - Test refersto:(author:'s parke' or author:ellis)""" self.assertEqual(self.parser.parse_query('refersto:(author:\'s parke\' or author:ellis)'), ['+', 'refersto:"author:\'s parke\' | author:ellis"']) def test_sqpp_alternation_of_quote_marks(self): """SearchQueryParenthesisedParser - Test refersto:(author:"s parke")""" self.assertEqual(self.parser.parse_query('refersto:(author:"s parke")'), ['+', 'refersto:author:"s parke"']) def test_sqpp_distributed_ands_equivalent(self): """SearchQueryParenthesisedParser - ellis and (kaluza-klein or r-parity) == ellis and (r-parity or kaluza-klein)""" self.assertEqual(sorted(perform_request_search(p='ellis and (kaluza-klein or r-parity)')), sorted(perform_request_search(p='ellis and (r-parity or kaluza-klein)'))) def test_sqpp_e_plus_e_minus(self): """SearchQueryParenthesisedParser - e(+)e(-)""" self.assertEqual(self.parser.parse_query('e(+)e(-)'), ['+', 'e(+)e(-)']) def test_sqpp_fe_2_plus(self): """SearchQueryParenthesisedParser - Fe(2+)""" self.assertEqual(self.parser.parse_query('Fe(2+)'), ['+', 'fe(2+)']) def test_sqpp_giant_evil_title_string(self): """SearchQueryParenthesisedParser - Measurements of CP-conserving trilinear gauge boson couplings WWV (V gamma, Z) in e(+)e(-) collisions at LEP2""" self.assertEqual(self.parser.parse_query('Measurements of CP-conserving trilinear gauge boson couplings WWV (V gamma, Z) in e(+)e(-) collisions at LEP2'), ['+', 'measurements', '+', 'of', '+', 'cp-conserving', '+', 'trilinear', '+', 'gauge', \ '+', 'boson', '+', 'couplings', '+', 'wwv', '+', 'v + gamma, + z', \ '+', 'in', '+', 'e(+)e(-)', '+', 'collisions', '+', 'at', '+', 'lep2']) def test_sqpp_second_order_operator_operates_on_parentheses(self): """SearchQueryParenthesisedParser - refersto:(author:ellis or author:hawking)""" self.assertEqual(self.parser.parse_query('refersto:(author:ellis or author:hawking)'), ['+', 'refersto:"author:ellis | author:hawking"']) class TestSpiresToInvenioSyntaxConverter(unittest.TestCase): """Test SPIRES query parsing and translation to Invenio syntax.""" def _compare_searches(self, invenio_syntax, spires_syntax): """Determine if two queries parse to the same search command. For comparison of actual search results (regression testing), see the tests in the Inspire module. """ parser = search_engine_query_parser.SearchQueryParenthesisedParser() converter = search_engine_query_parser.SpiresToInvenioSyntaxConverter() parsed_query = parser.parse_query(converter.convert_query(spires_syntax)) #parse_query removes any parens that convert_query added, but then #we have to rejoin the list it returns and create basic searches result_obtained = create_basic_search_units( None, ' '.join(parsed_query).replace('+ ',''), '', None ) # incase the desired result has parens parsed_wanted = parser.parse_query(invenio_syntax) result_wanted = create_basic_search_units( None, ' '.join(parsed_wanted).replace('+ ',''), '', None) assert result_obtained == result_wanted, \ """SPIRES parsed as %s instead of %s""" % \ (repr(result_obtained), repr(result_wanted)) return if CFG_WEBSEARCH_SPIRES_SYNTAX > 0: def test_operators(self): """SPIRES search syntax - find a ellis and t shapes""" invenio_search = "author:ellis and title:shapes" spires_search = "find a ellis and t shapes" self._compare_searches(invenio_search, spires_search) def test_nots(self): """SPIRES search syntax - find a ellis and not t hadronic and not t collisions""" invenio_search = "author:ellis and not title:hadronic and not title:collisions" spires_search = "find a ellis and not t hadronic and not t collisions" self._compare_searches(invenio_search, spires_search) def test_author_simplest(self): """SPIRES search syntax - find a ellis""" invenio_search = 'author:ellis' spires_search = 'find a ellis' self._compare_searches(invenio_search, spires_search) def test_author_simple(self): """SPIRES search syntax - find a ellis, j""" invenio_search = 'author:"ellis, j*"' spires_search = 'find a ellis, j' self._compare_searches(invenio_search, spires_search) def test_exactauthor_simple(self): """SPIRES search syntax - find ea ellis, j""" invenio_search = 'exactauthor:"ellis, j"' spires_search = 'find ea ellis, j' self._compare_searches(invenio_search, spires_search) def test_author_reverse(self): """SPIRES search syntax - find a j ellis""" invenio_search = 'author:"ellis, j*"' spires_search = 'find a j ellis' self._compare_searches(invenio_search, spires_search) def test_author_initials(self): """SPIRES search syntax - find a a m polyakov""" inv_search = 'author:"polyakov, a* m*"' spi_search = 'find a a m polyakov' self._compare_searches(inv_search, spi_search) def test_author_many_initials(self): """SPIRES search syntax - find a p d q bach""" inv_search = 'author:"bach, p* d* q*"' spi_search = 'find a p d q bach' self._compare_searches(inv_search, spi_search) def test_author_many_lastnames(self): """SPIRES search syntax - find a alvarez gaume, j r r""" inv_search = 'author:"alvarez gaume, j* r* r*"' spi_search = 'find a alvarez gaume, j r r' self._compare_searches(inv_search, spi_search) def test_author_full_initial(self): """SPIRES search syntax - find a klebanov, ig.r.""" inv_search = 'author:"klebanov, ig* r*" or exactauthor:"klebanov, i r"' spi_search = "find a klebanov, ig.r." self._compare_searches(inv_search, spi_search) def test_author_full_first(self): """SPIRES search syntax - find a ellis, john""" invenio_search = 'author:"ellis, john*" or exactauthor:"ellis, j *" or exactauthor:"ellis, j" or exactauthor:"ellis, jo" or exactauthor:"ellis, joh" or author:"ellis, john, *"' spires_search = 'find a ellis, john' self._compare_searches(invenio_search, spires_search) def test_combine_multiple(self): """SPIRES search syntax - find a gattringer, c and k symmetry chiral and not title chiral""" inv_search = 'author:"gattringer, c*" keyword:chiral keyword:symmetry -title:chiral' spi_search = "find a c gattringer and k chiral symmetry and not title chiral" self._compare_searches(inv_search, spi_search) def test_combine_multiple_or(self): """SPIRES search syntax - find a j ellis and (t report or k \"cross section\")""" inv_search = 'author:"ellis, j*" and (title:report or keyword:"cross section")' spi_search = 'find a j ellis and (t report or k "cross section")' self._compare_searches(inv_search, spi_search) def test_find_first_author(self): """SPIRES search syntax - find fa ellis""" inv_search = 'firstauthor:ellis' spi_search = 'find fa ellis' self._compare_searches(inv_search, spi_search) def test_find_first_author_initial(self): """SPIRES search syntax - find fa j ellis""" inv_search = 'firstauthor:"ellis, j*"' spi_search = 'find fa j ellis' self._compare_searches(inv_search, spi_search) def test_first_author_full_initial(self): """SPIRES search syntax - find fa klebanov, ig.r.""" inv_search = 'firstauthor:"klebanov, ig* r*" or exactfirstauthor:"klebanov, i r"' spi_search = "find fa klebanov, ig.r." self._compare_searches(inv_search, spi_search) def test_citedby_author(self): """SPIRES search syntax - find citedby author doggy""" inv_search = 'citedby:author:doggy' spi_search = 'find citedby author doggy' self._compare_searches(inv_search, spi_search) def test_refersto_author(self): """SPIRES search syntax - find refersto author kitty""" inv_search = 'refersto:author:kitty' spi_search = 'find refersto author kitty' self._compare_searches(inv_search, spi_search) def test_refersto_author_multi_name(self): """SPIRES search syntax - find a ellis and refersto author \"parke, sj\"""" inv_search = 'author:ellis refersto:author:"parke, s. j."' spi_search = 'find a ellis and refersto author "parke, s. j."' self._compare_searches(inv_search, spi_search) def test_refersto_author_multi_name_no_quotes(self): """SPIRES search syntax - find a ellis and refersto author parke, sj""" inv_search = 'author:ellis refersto:(author:"parke, sj*" or exactauthor:"parke, s *" or exactauthor:"parke, s" or author:"parke, sj, *")' spi_search = "find a ellis and refersto author parke, sj" self._compare_searches(inv_search, spi_search) def test_refersto_multi_word_no_quotes_no_index(self): """SPIRES search syntax - find refersto s parke""" inv_search = 'refersto:"s parke"' spi_search = 'find refersto s parke' self._compare_searches(inv_search, spi_search) def test_citedby_refersto_author(self): """SPIRES search syntax - find citedby refersto author penguin""" inv_search = 'refersto:citedby:author:penguin' spi_search = 'find refersto citedby author penguin' self._compare_searches(inv_search, spi_search) def test_irn_processing(self): """SPIRES search syntax - find irn 1360337 == find irn SPIRES-1360337""" # Added for trac-130 with_spires = "fin irn SPIRES-1360337" with_result = perform_request_search(p=with_spires) without_spires = "fin irn 1360337" without_result = perform_request_search(p=without_spires) # We don't care if results are [], as long as they're the same # Uncovered corner case: parsing could be broken and also happen to # return [] twice. Unlikely though. self.assertEqual(with_result, without_result) def test_topcite(self): """SPIRES search syntax - find topcite 50+""" inv_search = "cited:50->999999999" spi_search = "find topcite 50+" self._compare_searches(inv_search, spi_search) def test_topcit(self): """SPIRES search syntax - find topcit 50+""" inv_search = "cited:50->999999999" spi_search = "find topcit 50+" self._compare_searches(inv_search, spi_search) def test_caption(self): """SPIRES search syntax - find caption muon""" inv_search = "caption:muon" spi_search = "find caption muon" self._compare_searches(inv_search, spi_search) def test_caption_multi_word(self): """SPIRES search syntax - find caption quark mass""" inv_search = "caption:quark and caption:mass" spi_search = "find caption quark mass" self._compare_searches(inv_search, spi_search) def test_quotes(self): """SPIRES search syntax - find t 'compton scattering' and a mele""" inv_search = "title:'compton scattering' and author:mele" spi_search = "find t 'compton scattering' and a mele" self._compare_searches(inv_search, spi_search) def test_equals_sign(self): """SPIRES search syntax - find a beacom and date = 2000""" inv_search = "author:beacom year:2000" spi_search = "find a beacom and date = 2000" self._compare_searches(inv_search, spi_search) def test_type_code(self): """SPIRES search syntax - find tc/ps/scl review""" inv_search = "collection:review" spi_search = "find tc review" self._compare_searches(inv_search, spi_search) inv_search = "collection:review" spi_search = "find ps review" self._compare_searches(inv_search, spi_search) inv_search = "collection:review" spi_search = "find scl review" self._compare_searches(inv_search, spi_search) def test_field_code(self): """SPIRES search syntax - f f p""" inv_search = "subject:p" spi_search = "f f p" self._compare_searches(inv_search, spi_search) def test_coden(self): """SPIRES search syntax - find coden aphys""" inv_search = "journal:aphys" spi_search = "find coden aphys" self._compare_searches(inv_search, spi_search) def test_job_title(self): """SPIRES search syntax - find job engineer not position programmer""" inv_search = 'title:engineer not title:programmer' spi_search = 'find job engineer not position programmer' self._compare_searches(inv_search, spi_search) def test_job_rank(self): """SPIRES search syntax - find rank Postdoc""" inv_search = 'rank:Postdoc' spi_search = 'find rank Postdoc' self._compare_searches(inv_search, spi_search) def test_job_region(self): """SPIRES search syntax - find region EU not continent Europe""" inv_search = 'region:EU not region:Europe' spi_search = 'find region EU not continent Europe' self._compare_searches(inv_search, spi_search) def test_fin_to_find_trans(self): """SPIRES search syntax - fin a ellis, j == find a ellis, j""" fin_search = "fin a ellis, j" fin_result = perform_request_search(p=fin_search) find_search = "find a ellis, j" find_result = perform_request_search(p=find_search) # We don't care if results are [], as long as they're the same # Uncovered corner case: parsing could be broken and also happen to # return [] twice. Unlikely though. self.assertEqual(fin_result, find_result) def test_distribution_of_notted_search_terms(self): """SPIRES search syntax - find t this and not that ->title:this and not title:that""" spi_search = "find t this and not that" inv_search = "title:this and not title:that" self._compare_searches(inv_search, spi_search) def test_distribution_without_spacing(self): """SPIRES search syntax - find aff SLAC and Stanford ->affiliation:SLAC and affiliation:Stanford""" # motivated by trac-187 spi_search = "find aff SLAC and Stanford" inv_search = "affiliation:SLAC and affiliation:Stanford" self._compare_searches(inv_search, spi_search) def test_distribution_with_phrases(self): """SPIRES search syntax - find aff Penn State U -> affiliation:"Penn State U""" # motivated by trac-517 spi_search = "find aff Penn State U" inv_search = "affiliation:\"Penn State U\"" self._compare_searches(inv_search, spi_search) def test_distribution_with_many_clauses(self): """SPIRES search syntax - find a mele and brooks and holtkamp and o'connell""" spi_search = "find a mele and brooks and holtkamp and o'connell" inv_search = "author:mele author:brooks author:holtkamp author:o'connell" self._compare_searches(inv_search, spi_search) def test_keyword_as_kw(self): """SPIRES search syntax - find kw something ->keyword:something""" spi_search = "find kw meson" inv_search = "keyword:meson" self._compare_searches(inv_search, spi_search) def test_recid(self): """SPIRES search syntax - find recid 11111""" spi_search = 'find recid 111111' inv_search = 'recid:111111' self._compare_searches(inv_search, spi_search) def test_desy_keyword_translation(self): """SPIRES search syntax - find dk "B --> pi pi" """ spi_search = "find dk \"B --> pi pi\"" inv_search = "695__a:\"B --> pi pi\"" self._compare_searches(inv_search, spi_search) def test_journal_section_joining(self): """SPIRES search syntax - journal Phys.Lett, 0903, 024 -> journal:Phys.Lett,0903,024""" spi_search = "find j Phys.Lett, 0903, 024" inv_search = "journal:Phys.Lett,0903,024" self._compare_searches(inv_search, spi_search) def test_journal_search_with_colon(self): """SPIRES search syntax - find j physics 1:195 -> journal:physics,1,195""" spi_search = "find j physics 1:195" inv_search = "journal:physics,1,195" self._compare_searches(inv_search, spi_search) def test_journal_non_triple_syntax(self): """SPIRES search syntax - find j physics jcap""" spi_search = "find j physics jcap" inv_search = "journal:physics and journal:jcap" self._compare_searches(inv_search, spi_search) def test_journal_triple_with_many_spaces(self): """SPIRES search syntax - find j physics 0903 024""" spi_search = 'find j physics 0903 024' inv_search = 'journal:physics,0903,024' self._compare_searches(inv_search, spi_search) def test_distribution_of_search_terms(self): """SPIRES search syntax - find t this and that ->title:this and title:that""" spi_search = "find t this and that" inv_search = "title:this and title:that" self._compare_searches(inv_search, spi_search) def test_syntax_converter_expand_search_patterns_alone(self): """SPIRES search syntax - simplest expansion""" spi_search = "find t bob sam" inv_search = "title:bob and title:sam" self._compare_searches(inv_search, spi_search) def test_syntax_converter_expand_fulltext(self): """SPIRES search syntax - fulltext support""" spi_search = "find ft The holographic RG is based on" inv_search = "fulltext:The and fulltext:holographic and fulltext:RG and fulltext:is and fulltext:based and fulltext:on" self._compare_searches(inv_search, spi_search) def test_syntax_converter_expand_fulltext_within_larger(self): """SPIRES search syntax - fulltext subsearch support""" spi_search = "find au taylor and ft The holographic RG is based on and t brane" inv_search = "author:taylor fulltext:The and fulltext:holographic and fulltext:RG and fulltext:is and fulltext:based and fulltext:on title:brane" self._compare_searches(inv_search, spi_search) def test_syntax_converter_expand_search_patterns_conjoined(self): """SPIRES search syntax - simplest distribution""" spi_search = "find t bob and sam" inv_search = "title:bob and title:sam" self._compare_searches(inv_search, spi_search) def test_syntax_converter_expand_search_patterns_multiple(self): """SPIRES search syntax - expansion (no distribution)""" spi_search = "find t bob sam and k couch" inv_search = "title:bob and title:sam and keyword:couch" self._compare_searches(inv_search, spi_search) def test_syntax_converter_expand_search_patterns_multiple_conjoined(self): """SPIRES search syntax - distribution and expansion""" spi_search = "find t bob sam and couch" inv_search = "title:bob and title:sam and title:couch" self._compare_searches(inv_search, spi_search) + def test_date_invalid(self): + """SPIRES search syntax - searching an invalid date""" + spi_search = "find date foo" + inv_search = "year:foo" + self._compare_searches(inv_search, spi_search) + def test_date_by_yr(self): """SPIRES search syntax - searching by date year""" spi_search = "find date 2002" inv_search = "year:2002" self._compare_searches(inv_search, spi_search) def test_date_by_lt_yr(self): """SPIRES search syntax - searching by date < year""" spi_search = "find date < 2002" inv_search = 'year:0->2002' self._compare_searches(inv_search, spi_search) def test_date_by_gt_yr(self): """SPIRES search syntax - searching by date > year""" spi_search = "find date > 1980" inv_search = 'year:1980->9999' self._compare_searches(inv_search, spi_search) def test_date_by_yr_mo(self): """SPIRES search syntax - searching by date 1976-04""" spi_search = "find date 1976-04" inv_search = 'year:1976-04' self._compare_searches(inv_search, spi_search) def test_date_by_yr_mo_day_wholemonth_and_suffix(self): """SPIRES search syntax - searching by date 1976-04-01 and t dog""" spi_search = "find date 1976-04-01 and t dog" - inv_search = 'year:1976-04 and title:dog' + inv_search = 'year:1976-04-01 and title:dog' self._compare_searches(inv_search, spi_search) def test_date_by_yr_mo_day_and_suffix(self): """SPIRES search syntax - searching by date 1976-04-05 and t dog""" spi_search = "find date 1976-04-05 and t dog" inv_search = 'year:1976-04-05 and title:dog' self._compare_searches(inv_search, spi_search) def test_date_by_eq_yr_mo(self): """SPIRES search syntax - searching by date 1976-04""" spi_search = "find date 1976-04" inv_search = 'year:1976-04' self._compare_searches(inv_search, spi_search) def test_date_by_lt_yr_mo(self): """SPIRES search syntax - searching by date < 1978-10-21""" spi_search = "find date < 1978-10-21" inv_search = 'year:0->1978-10-21' self._compare_searches(inv_search, spi_search) def test_date_by_gt_yr_mo(self): """SPIRES search syntax - searching by date > 1978-10-21""" spi_search = "find date > 1978-10-21" inv_search = 'year:1978-10-21->9999' self._compare_searches(inv_search, spi_search) + if DATEUTIL_AVAILABLE: + def test_date_2_digits_year_month_day(self): + """SPIRES search syntax - searching by date > 78-10-21""" + spi_search = "find date 78-10-21" + inv_search = 'year:1978-10-21' + self._compare_searches(inv_search, spi_search) + + if DATEUTIL_AVAILABLE: + def test_date_2_digits_year(self): + """SPIRES search syntax - searching by date 78""" + spi_search = "find date 78" + inv_search = 'year:1978' + self._compare_searches(inv_search, spi_search) + + if DATEUTIL_AVAILABLE: + def test_date_2_digits_year_future(self): + """SPIRES search syntax - searching by date 2 years in the future""" + d = datetime.datetime.today() + datetime.timedelta(days=730) + spi_search = "find date %s" % d.strftime("%y") + inv_search = 'year:%s' % d.strftime("%Y") + self._compare_searches(inv_search, spi_search) + + if DATEUTIL_AVAILABLE: + def test_date_2_digits_month_year(self): + """SPIRES search syntax - searching by date feb 12""" + # This should give us "feb 12" with us locale + d = datetime.datetime(year=2012, month=2, day=1) + date_str = d.strftime('%b %y') + spi_search = "find date %s" % date_str + inv_search = 'year:2012-02' + self._compare_searches(inv_search, spi_search) + def test_spires_syntax_trailing_colon(self): """SPIRES search syntax - test for blowup with trailing colon""" spi_search = "find a watanabe:" invenio_search = "author:watanabe:" self._compare_searches(invenio_search, spi_search) if DATEUTIL_AVAILABLE: def test_date_by_lt_d_MO_yr(self): """SPIRES search syntax - searching by date < 23 Sep 2010: will only work with dateutil installed""" spi_search = "find date < 23 Sep 2010" inv_search = 'year:0->2010-09-23' self._compare_searches(inv_search, spi_search) def test_date_by_gt_d_MO_yr(self): """SPIRES search syntax - searching by date > 12 Jun 1960: will only work with dateutil installed""" spi_search = "find date > 12 Jun 1960" inv_search = 'year:1960-06-12->9999' self._compare_searches(inv_search, spi_search) def test_date_accept_today(self): """SPIRES search syntax - searching by today""" spi_search = "find date today" inv_search = "year:" + datetime.datetime.strftime(datetime.datetime.today(), '%Y-%m-%d') self._compare_searches(inv_search, spi_search) def test_date_accept_yesterday(self): """SPIRES search syntax - searching by yesterday""" import dateutil.relativedelta spi_search = "find date yesterday" inv_search = "year:" + datetime.datetime.strftime(datetime.datetime.today()+dateutil.relativedelta.relativedelta(days=-1), '%Y-%m-%d') self._compare_searches(inv_search, spi_search) def test_date_accept_this_month(self): """SPIRES search syntax - searching by this month""" spi_search = "find date this month" inv_search = "year:" + datetime.datetime.strftime(datetime.datetime.today(), '%Y-%m') self._compare_searches(inv_search, spi_search) def test_date_accept_last_month(self): """SPIRES search syntax - searching by last month""" spi_search = "find date last month" inv_search = "year:" + datetime.datetime.strftime(datetime.datetime.today()\ +dateutil.relativedelta.relativedelta(months=-1), '%Y-%m') self._compare_searches(inv_search, spi_search) def test_date_accept_this_week(self): """SPIRES search syntax - searching by this week""" spi_search = "find date this week" - inv_search = "year:" + datetime.datetime.strftime(datetime.datetime.today()\ - +dateutil.relativedelta.relativedelta(days=-(datetime.datetime.today().isoweekday()%7)), '%Y-%m-%d') + begin = datetime.datetime.today() + days_to_remove = datetime.datetime.today().isoweekday() % 7 + begin += du_delta(days=-days_to_remove) + begin_str = datetime.datetime.strftime(begin, '%Y-%m-%d') + # Only 6 days cause the last day is included in the search + end = datetime.datetime.today() + end_str = datetime.datetime.strftime(end, '%Y-%m-%d') + inv_search = "year:%s->%s" % (begin_str, end_str) self._compare_searches(inv_search, spi_search) def test_date_accept_last_week(self): """SPIRES search syntax - searching by last week""" spi_search = "find date last week" - inv_search = "year:" + datetime.datetime.strftime(datetime.datetime.today()\ - +dateutil.relativedelta.relativedelta(days=-(7+(datetime.datetime.today().isoweekday()%7))), '%Y-%m-%d') + begin = datetime.datetime.today() + days_to_remove = 7 + datetime.datetime.today().isoweekday() % 7 + begin += du_delta(days=-days_to_remove) + begin_str = datetime.datetime.strftime(begin, '%Y-%m-%d') + # Only 6 days cause the last day is included in the search + end = begin + du_delta(days=6) + end_str = datetime.datetime.strftime(end, '%Y-%m-%d') + inv_search = "year:%s->%s" % (begin_str, end_str) self._compare_searches(inv_search, spi_search) def test_date_accept_date_minus_days(self): """SPIRES search syntax - searching by 2011-01-03 - 2""" spi_search = "find date 2011-01-03 - 2" - inv_search = "year:2011-01" + inv_search = "year:2011-01-01" self._compare_searches(inv_search, spi_search) def test_date_accept_date_minus_days_with_month_wrap(self): """SPIRES search syntax - searching by 2011-03-01 - 1""" spi_search = "find date 2011-03-01 - 1" inv_search = "year:2011-02-28" self._compare_searches(inv_search, spi_search) def test_date_accept_date_minus_days_with_year_wrap(self): """SPIRES search syntax - searching by 2011-01-01 - 1""" spi_search = "find date 2011-01-01 - 1" inv_search = "year:2010-12-31" self._compare_searches(inv_search, spi_search) def test_date_accept_date_minus_days_with_leapyear_february(self): """SPIRES search syntax - searching by 2008-03-01 - 1""" spi_search = "find date 2008-03-01 - 1" inv_search = "year:2008-02-29" self._compare_searches(inv_search, spi_search) def test_date_accept_date_minus_many_days(self): """SPIRES search syntax - searching by 2011-02-24 - 946""" spi_search = "find date 2011-02-24 - 946" inv_search = "year:2008-07-23" self._compare_searches(inv_search, spi_search) def test_date_accept_date_plus_days(self): """SPIRES search syntax - searching by 2011-01-03 + 2""" spi_search = "find date 2011-01-01 + 2" inv_search = "year:2011-01-03" self._compare_searches(inv_search, spi_search) def test_date_accept_plus_days_with_month_wrap(self): """SPIRES search syntax - searching by 2011-03-31 + 2""" spi_search = "find date 2011-03-31 + 2" inv_search = "year:2011-04-02" self._compare_searches(inv_search, spi_search) def test_date_accept_date_plus_days_with_year_wrap(self): """SPIRES search syntax - searching by 2011-12-31 + 1""" spi_search = "find date 2011-12-31 + 1" - inv_search = "year:2012-01" + inv_search = "year:2012-01-01" self._compare_searches(inv_search, spi_search) def test_date_accept_date_plus_days_with_leapyear_february(self): """SPIRES search syntax - searching by 2008-02-29 + 2""" spi_search = "find date 2008-02-28 + 2" - inv_search = "year:2008-03" + inv_search = "year:2008-03-01" self._compare_searches(inv_search, spi_search) def test_date_accept_date_plus_many_days(self): """SPIRES search syntax - searching by 2011-02-24 + 666""" spi_search = "find date 2011-02-24 + 666" inv_search = "year:2012-12-21" self._compare_searches(inv_search, spi_search) def test_spires_syntax_detected_f(self): """SPIRES search syntax - test detection f t p""" # trac #261 converter = search_engine_query_parser.SpiresToInvenioSyntaxConverter() spi_search = converter.is_applicable("f t p") self.assertEqual(spi_search, True) def test_spires_syntax_detected_fin(self): """SPIRES search syntax - test detection fin t p""" # trac #261 converter = search_engine_query_parser.SpiresToInvenioSyntaxConverter() spi_search = converter.is_applicable("fin t p") self.assertEqual(spi_search, True) def test_spires_keyword_distribution_before_conjunctions(self): """SPIRES search syntax - test find journal phys.lett. 0903 024""" spi_search = 'find journal phys.lett. 0903 024' inv_search = '(journal:phys.lett.,0903,024)' self._compare_searches(inv_search, spi_search) def test_spires_keyword_distribution_with_parens(self): """SPIRES search syntax - test find cn d0 and (a abachi or abbott or abazov)""" spi_search = "find cn d0 and (a abachi or abbott or abazov)" inv_search = "collaboration:d0 and (author:abachi or author:abbott or author:abazov)" self._compare_searches(inv_search, spi_search) def test_super_short_author_name(self): """SPIRES search syntax - test fin a er and cn cms""" spi_search = "fin a er and cn cms" inv_search = "author:er collaboration:cms" self._compare_searches(inv_search, spi_search) def test_simple_syntax_mixing(self): """SPIRES and invenio search syntax - find a ellis and citedby:hawking""" combo_search = "find a ellis and citedby:hawking" inv_search = "author:ellis citedby:hawking" self._compare_searches(inv_search, combo_search) def test_author_first_syntax_mixing(self): """SPIRES and invenio search syntax - find a dixon, l.j. cited:10->52""" combo_search = 'find a dixon, l.j. cited:10->52' inv_search = 'author:"dixon, l* j*" cited:10->52' self._compare_searches(inv_search, combo_search) def test_minus_boolean_syntax_mixing(self): """SPIRES and invenio search syntax - find a ellis -title:muon""" combo_search = 'find a ellis -title:muon' inv_search = 'author:ellis -title:muon' self._compare_searches(inv_search, combo_search) def test_plus_boolean_syntax_mixing(self): """SPIRES and invenio search syntax - find a ellis +title:muon""" combo_search = 'find a ellis +title:muon' inv_search = 'author:ellis title:muon' self._compare_searches(inv_search, combo_search) def test_second_level_syntax_mixing(self): """SPIRES and invenio search syntax - find a ellis refersto:author:hawking""" combo_search = 'find a ellis refersto:author:hawking' inv_search = 'author:ellis refersto:author:hawking' self._compare_searches(inv_search, combo_search) if CFG_WEBSEARCH_SPIRES_SYNTAX > 1: def test_absorbs_naked_a_search(self): """SPIRES search syntax - a ellis""" invenio_search = "author:ellis" naked_search = "a ellis" self._compare_searches(invenio_search, naked_search) def test_absorbs_naked_author_search(self): """SPIRES search syntax - author ellis""" invenio_search = "author:ellis" spi_search = "author ellis" self._compare_searches(invenio_search, spi_search) def test_spires_syntax_detected_naked_a(self): """SPIRES search syntax - test detection a ellis""" converter = search_engine_query_parser.SpiresToInvenioSyntaxConverter() spi_search = converter.is_applicable("a ellis") self.assertEqual(spi_search, True) def test_spires_syntax_detected_naked_author(self): """SPIRES search syntax - test detection author ellis""" converter = search_engine_query_parser.SpiresToInvenioSyntaxConverter() spi_search = converter.is_applicable("author ellis") self.assertEqual(spi_search, True) def test_spires_syntax_detected_naked_author_leading_spaces(self): """SPIRES search syntax - test detection author ellis""" converter = search_engine_query_parser.SpiresToInvenioSyntaxConverter() spi_search = converter.is_applicable(" author ellis") self.assertEqual(spi_search, True) def test_spires_syntax_detected_naked_title(self): """SPIRES search syntax - test detection t muon""" converter = search_engine_query_parser.SpiresToInvenioSyntaxConverter() spi_search = converter.is_applicable("t muon") self.assertEqual(spi_search, True) def test_spires_syntax_detected_second_keyword(self): """SPIRES search syntax - test detection author:ellis and t muon""" converter = search_engine_query_parser.SpiresToInvenioSyntaxConverter() spi_search = converter.is_applicable("author:ellis and t muon") self.assertEqual(spi_search, True) def test_spires_syntax_detected_invenio(self): """SPIRES search syntax - test detection Not SPIRES""" # trac #261 converter = search_engine_query_parser.SpiresToInvenioSyntaxConverter() inv_search = converter.is_applicable("t:p a:c") self.assertEqual(inv_search, False) def test_invenio_syntax_only_second_level(self): """invenio search syntax - citedby:reportnumber:hep-th/0205061""" inv_search = 'citedby:reportnumber:hep-th/0205061' self._compare_searches(inv_search, inv_search) def test_invenio_syntax_only_boolean(self): """invenio search syntax - author:ellis and not title:hadronic and not title:collisions""" inv_search = "author:ellis and not title:hadronic and not title:collisions" self._compare_searches(inv_search, inv_search) TEST_SUITE = make_test_suite(TestSearchQueryParenthesisedParser, TestSpiresToInvenioSyntaxConverter, TestParserUtilityFunctions) if __name__ == "__main__": run_test_suite(TEST_SUITE) #run_test_suite(make_test_suite(TestParserUtilityFunctions, TestSearchQueryParenthesisedParser)) # DEBUG diff --git a/modules/websearch/lib/websearch_regression_tests.py b/modules/websearch/lib/websearch_regression_tests.py index 4e603e870..660806c4a 100644 --- a/modules/websearch/lib/websearch_regression_tests.py +++ b/modules/websearch/lib/websearch_regression_tests.py @@ -1,2852 +1,2854 @@ # -*- coding: utf-8 -*- ## ## This file is part of Invenio. ## Copyright (C) 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. # pylint: disable=C0301 # pylint: disable=E1102 """WebSearch module regression tests.""" __revision__ = "$Id$" import unittest import re import urlparse, cgi import sys import cStringIO if sys.hexversion < 0x2040000: # pylint: disable=W0622 from sets import Set as set # pylint: enable=W0622 from mechanize import Browser, LinkNotFoundError from invenio.config import CFG_SITE_URL, CFG_SITE_NAME, CFG_SITE_LANG, \ CFG_SITE_RECORD, CFG_SITE_LANGS, \ CFG_SITE_SECURE_URL, CFG_WEBSEARCH_SPIRES_SYNTAX from invenio.testutils import make_test_suite, \ run_test_suite, \ nottest, \ make_url, make_surl, test_web_page_content, \ merge_error_messages from invenio.urlutils import same_urls_p from invenio.dbquery import run_sql from invenio.search_engine import perform_request_search, \ guess_primary_collection_of_a_record, guess_collection_of_a_record, \ collection_restricted_p, get_permitted_restricted_collections, \ search_pattern, search_unit, search_unit_in_bibrec, \ wash_colls, record_public_p from invenio import search_engine_summarizer from invenio.search_engine_utils import get_fieldvalues from invenio.intbitset import intbitset from invenio.search_engine import intersect_results_with_collrecs from invenio.bibrank_bridge_utils import get_external_word_similarity_ranker +from invenio.search_engine_query_parser_unit_tests import DATEUTIL_AVAILABLE if 'fr' in CFG_SITE_LANGS: lang_french_configured = True else: lang_french_configured = False def parse_url(url): parts = urlparse.urlparse(url) query = cgi.parse_qs(parts[4], True) return parts[2].split('/')[1:], query def string_combinations(str_list): """Returns all the possible combinations of the strings in the list. Example: for the list ['A','B','Cd'], it will return [['Cd', 'B', 'A'], ['B', 'A'], ['Cd', 'A'], ['A'], ['Cd', 'B'], ['B'], ['Cd'], []] It adds "B", "H", "F" and "S" values to the results so different combinations of them are also checked. """ out_list = [] for i in range(len(str_list) + 1): out_list += list(combinations(str_list, i)) for i in range(len(out_list)): out_list[i] = (list(out_list[i]) + { 0: lambda: ["B", "H", "S"], 1: lambda: ["B", "H", "F"], 2: lambda: ["B", "F", "S"], 3: lambda: ["B", "F"], 4: lambda: ["B", "S"], 5: lambda: ["B", "H"], 6: lambda: ["B"] }[i % 7]()) return out_list def combinations(iterable, r): """Return r length subsequences of elements from the input iterable.""" # combinations('ABCD', 2) --> AB AC AD BC BD CD # combinations(range(4), 3) --> 012 013 023 123 pool = tuple(iterable) n = len(pool) if r > n: return indices = range(r) yield tuple(pool[i] for i in indices) while True: for i in reversed(range(r)): if indices[i] != i + n - r: break else: return indices[i] += 1 for j in range(i+1, r): indices[j] = indices[j-1] + 1 yield tuple(pool[i] for i in indices) class WebSearchWebPagesAvailabilityTest(unittest.TestCase): """Check WebSearch web pages whether they are up or not.""" def test_search_interface_pages_availability(self): """websearch - availability of search interface pages""" baseurl = CFG_SITE_URL + '/' _exports = ['', 'collection/Poetry', 'collection/Poetry?as=1'] error_messages = [] for url in [baseurl + page for page in _exports]: error_messages.extend(test_web_page_content(url)) if error_messages: self.fail(merge_error_messages(error_messages)) return def test_search_results_pages_availability(self): """websearch - availability of search results pages""" baseurl = CFG_SITE_URL + '/search' _exports = ['', '?c=Poetry', '?p=ellis', '/cache', '/log'] error_messages = [] for url in [baseurl + page for page in _exports]: error_messages.extend(test_web_page_content(url)) if error_messages: self.fail(merge_error_messages(error_messages)) return def test_search_detailed_record_pages_availability(self): """websearch - availability of search detailed record pages""" baseurl = CFG_SITE_URL + '/'+ CFG_SITE_RECORD +'/' _exports = ['', '1', '1/', '1/files', '1/files/'] error_messages = [] for url in [baseurl + page for page in _exports]: error_messages.extend(test_web_page_content(url)) if error_messages: self.fail(merge_error_messages(error_messages)) return def test_browse_results_pages_availability(self): """websearch - availability of browse results pages""" baseurl = CFG_SITE_URL + '/search' _exports = ['?p=ellis&f=author&action_browse=Browse'] error_messages = [] for url in [baseurl + page for page in _exports]: error_messages.extend(test_web_page_content(url)) if error_messages: self.fail(merge_error_messages(error_messages)) return def test_help_page_availability(self): """websearch - availability of Help Central page""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/help', expected_text="Help Central")) if lang_french_configured: def test_help_page_availability_fr(self): """websearch - availability of Help Central page in french""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/help/?ln=fr', expected_text="Centre d'aide")) def test_search_tips_page_availability(self): """websearch - availability of Search Tips""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/help/search-tips', expected_text="Search Tips")) if lang_french_configured: def test_search_tips_page_availability_fr(self): """websearch - availability of Search Tips in french""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/help/search-tips?ln=fr', expected_text="Conseils de recherche")) def test_search_guide_page_availability(self): """websearch - availability of Search Guide""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/help/search-guide', expected_text="Search Guide")) if lang_french_configured: def test_search_guide_page_availability_fr(self): """websearch - availability of Search Guide in french""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/help/search-guide?ln=fr', expected_text="Guide de recherche")) class WebSearchTestLegacyURLs(unittest.TestCase): """ Check that the application still responds to legacy URLs for navigating, searching and browsing.""" def test_legacy_collections(self): """ websearch - collections handle legacy urls """ browser = Browser() def check(legacy, new, browser=browser): browser.open(legacy) got = browser.geturl() self.failUnless(same_urls_p(got, new), got) # Use the root URL unless we need more check(make_url('/', c=CFG_SITE_NAME), make_url('/', ln=CFG_SITE_LANG)) # Other collections are redirected in the /collection area check(make_url('/', c='Poetry'), make_url('/collection/Poetry', ln=CFG_SITE_LANG)) # Drop unnecessary arguments, like ln and as (when they are # the default value) args = {'as': 0} check(make_url('/', c='Poetry', **args), make_url('/collection/Poetry', ln=CFG_SITE_LANG)) # Otherwise, keep them args = {'as': 1, 'ln': CFG_SITE_LANG} check(make_url('/', c='Poetry', **args), make_url('/collection/Poetry', **args)) # Support the /index.py addressing too check(make_url('/index.py', c='Poetry'), make_url('/collection/Poetry', ln=CFG_SITE_LANG)) def test_legacy_search(self): """ websearch - search queries handle legacy urls """ browser = Browser() def check(legacy, new, browser=browser): browser.open(legacy) got = browser.geturl() self.failUnless(same_urls_p(got, new), got) # /search.py is redirected on /search # Note that `as' is a reserved word in Python 2.5 check(make_url('/search.py', p='nuclear', ln='en') + 'as=1', make_url('/search', p='nuclear', ln='en') + 'as=1') if lang_french_configured: def test_legacy_search_fr(self): """ websearch - search queries handle legacy urls """ browser = Browser() def check(legacy, new, browser=browser): browser.open(legacy) got = browser.geturl() self.failUnless(same_urls_p(got, new), got) # direct recid searches are redirected to /CFG_SITE_RECORD check(make_url('/search.py', recid=1, ln='fr'), make_url('/%s/1' % CFG_SITE_RECORD, ln='fr')) def test_legacy_search_help_link(self): """websearch - legacy Search Help page link""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/help/search/index.en.html', expected_text="Help Central")) if lang_french_configured: def test_legacy_search_tips_link(self): """websearch - legacy Search Tips page link""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/help/search/tips.fr.html', expected_text="Conseils de recherche")) def test_legacy_search_guide_link(self): """websearch - legacy Search Guide page link""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/help/search/guide.en.html', expected_text="Search Guide")) class WebSearchTestRecord(unittest.TestCase): """ Check the interface of the /CFG_SITE_RECORD results """ def test_format_links(self): """ websearch - check format links for records """ browser = Browser() # We open the record in all known HTML formats for hformat in ('hd', 'hx', 'hm'): browser.open(make_url('/%s/1' % CFG_SITE_RECORD, of=hformat)) if hformat == 'hd': # hd format should have a link to the following # formats for oformat in ('hx', 'hm', 'xm', 'xd'): target = make_url('/%s/1/export/%s?ln=en' % (CFG_SITE_RECORD, oformat)) try: browser.find_link(url=target) except LinkNotFoundError: self.fail('link %r should be in page' % target) else: # non-hd HTML formats should have a link back to # the main detailed record target = make_url('/%s/1' % CFG_SITE_RECORD) try: browser.find_link(url=target) except LinkNotFoundError: self.fail('link %r should be in page' % target) return def test_exported_formats(self): """ websearch - check formats exported through /CFG_SITE_RECORD/1/export/ URLs""" self.assertEqual([], test_web_page_content(make_url('/%s/1/export/hm' % CFG_SITE_RECORD), expected_text='245__ $$aALEPH experiment')) self.assertEqual([], test_web_page_content(make_url('/%s/1/export/hd' % CFG_SITE_RECORD), expected_text='ALEPH experiment')) self.assertEqual([], test_web_page_content(make_url('/%s/1/export/xm' % CFG_SITE_RECORD), expected_text='ALEPH experiment')) self.assertEqual([], test_web_page_content(make_url('/%s/1/export/xd' % CFG_SITE_RECORD), expected_text='ALEPH experiment')) self.assertEqual([], test_web_page_content(make_url('/%s/1/export/hs' % CFG_SITE_RECORD), expected_text='ALEPH experiment' % \ (CFG_SITE_RECORD, CFG_SITE_LANG))) self.assertEqual([], test_web_page_content(make_url('/%s/1/export/hx' % CFG_SITE_RECORD), expected_text='title = "ALEPH experiment')) self.assertEqual([], test_web_page_content(make_url('/%s/1/export/t?ot=245' % CFG_SITE_RECORD), expected_text='245__ $$aALEPH experiment')) self.assertNotEqual([], test_web_page_content(make_url('/%s/1/export/t?ot=245' % CFG_SITE_RECORD), expected_text='001__')) self.assertEqual([], test_web_page_content(make_url('/%s/1/export/h?ot=245' % CFG_SITE_RECORD), expected_text='245__ $$aALEPH experiment')) self.assertNotEqual([], test_web_page_content(make_url('/%s/1/export/h?ot=245' % CFG_SITE_RECORD), expected_text='001__')) return def test_plots_tab(self): """ websearch - test to ensure the plots tab is working """ self.assertEqual([], test_web_page_content(make_url('/%s/8/plots' % CFG_SITE_RECORD), expected_text='div id="clip"', unexpected_text='Abstract')) def test_meta_header(self): """ websearch - test that metadata embedded in header of hd relies on hdm format and Default_HTML_meta bft, but hook is in websearch to display the format """ self.assertEqual([], test_web_page_content(make_url('/record/1'), expected_text='')) return class WebSearchTestCollections(unittest.TestCase): def test_traversal_links(self): """ websearch - traverse all the publications of a collection """ browser = Browser() try: for aas in (0, 1): args = {'as': aas} browser.open(make_url('/collection/Preprints', **args)) for jrec in (11, 21, 11, 27): args = {'jrec': jrec, 'cc': 'Preprints'} if aas: args['as'] = aas url = make_url('/search', **args) try: browser.follow_link(url=url) except LinkNotFoundError: args['ln'] = CFG_SITE_LANG url = make_url('/search', **args) browser.follow_link(url=url) except LinkNotFoundError: self.fail('no link %r in %r' % (url, browser.geturl())) def test_collections_links(self): """ websearch - enter in collections and subcollections """ browser = Browser() def tryfollow(url): cur = browser.geturl() body = browser.response().read() try: browser.follow_link(url=url) except LinkNotFoundError: print body self.fail("in %r: could not find %r" % ( cur, url)) return for aas in (0, 1): if aas: kargs = {'as': 1} else: kargs = {} kargs['ln'] = CFG_SITE_LANG # We navigate from immediate son to immediate son... browser.open(make_url('/', **kargs)) tryfollow(make_url('/collection/Articles%20%26%20Preprints', **kargs)) tryfollow(make_url('/collection/Articles', **kargs)) # But we can also jump to a grandson immediately browser.back() browser.back() tryfollow(make_url('/collection/ALEPH', **kargs)) return def test_records_links(self): """ websearch - check the links toward records in leaf collections """ browser = Browser() browser.open(make_url('/collection/Preprints')) def harvest(): """ Parse all the links in the page, and check that for each link to a detailed record, we also have the corresponding link to the similar records.""" records = set() similar = set() for link in browser.links(): path, q = parse_url(link.url) if not path: continue if path[0] == CFG_SITE_RECORD: records.add(int(path[1])) continue if path[0] == 'search': if not q.get('rm') == ['wrd']: continue recid = q['p'][0].split(':')[1] similar.add(int(recid)) self.failUnlessEqual(records, similar) return records # We must have 10 links to the corresponding /CFG_SITE_RECORD found = harvest() self.failUnlessEqual(len(found), 10) # When clicking on the "Search" button, we must also have # these 10 links on the records. browser.select_form(name="search") browser.submit() found = harvest() self.failUnlessEqual(len(found), 10) return def test_em_parameter(self): """ websearch - check different values of em return different parts of the collection page""" for combi in string_combinations(["L", "P", "Prt"]): url = '/collection/Articles?em=%s' % ','.join(combi) expected_text = ["Development of photon beam diagnostics for VUV radiation from a SASE FEL"] unexpected_text = [] if "H" in combi: expected_text.append(">Atlantis Institute of Fictive Science") else: unexpected_text.append(">Atlantis Institute of Fictive Science") if "F" in combi: expected_text.append("This site is also available in the following languages:") else: unexpected_text.append("This site is also available in the following languages:") if "S" in combi: expected_text.append('value="Search"') else: unexpected_text.append('value="Search"') if "L" in combi: expected_text.append('Search also:') else: unexpected_text.append('Search also:') if "Prt" in combi or "P" in combi: expected_text.append('
ABOUT ARTICLES
') else: unexpected_text.append('
ABOUT ARTICLES
') self.assertEqual([], test_web_page_content(make_url(url), expected_text=expected_text, unexpected_text=unexpected_text)) return class WebSearchTestBrowse(unittest.TestCase): def test_browse_field(self): """ websearch - check that browsing works """ browser = Browser() browser.open(make_url('/')) browser.select_form(name='search') browser['f'] = ['title'] browser.submit(name='action_browse') def collect(): # We'll get a few links to search for the actual hits, plus a # link to the following results. res = [] for link in browser.links(url_regex=re.compile(CFG_SITE_URL + r'/search\?')): if link.text == 'Advanced Search': continue dummy, q = parse_url(link.url) res.append((link, q)) return res # if we follow the last link, we should get another # batch. There is an overlap of one item. batch_1 = collect() browser.follow_link(link=batch_1[-1][0]) batch_2 = collect() # FIXME: we cannot compare the whole query, as the collection # set is not equal self.failUnlessEqual(batch_1[-2][1]['p'], batch_2[0][1]['p']) def test_browse_restricted_record_as_unauthorized_user(self): """websearch - browse for a record that belongs to a restricted collection as an unauthorized user.""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?p=CERN-THESIS-99-074&f=088__a&action_browse=Browse&ln=en', username = 'guest', expected_text = ['Hits', '088__a'], unexpected_text = ['>CERN-THESIS-99-074']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_browse_restricted_record_as_unauthorized_user_in_restricted_collection(self): """websearch - browse for a record that belongs to a restricted collection as an unauthorized user.""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?p=CERN-THESIS-99-074&f=088__a&action_browse=Browse&c=ALEPH+Theses&ln=en', username='guest', expected_text= ['This collection is restricted'], unexpected_text= ['Hits', '>CERN-THESIS-99-074']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_browse_restricted_record_as_authorized_user(self): """websearch - browse for a record that belongs to a restricted collection as an authorized user.""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?p=CERN-THESIS-99-074&f=088__a&action_browse=Browse&ln=en', username='admin', password='', expected_text= ['Hits', '088__a'], unexpected_text = ['>CERN-THESIS-99-074']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_browse_restricted_record_as_authorized_user_in_restricted_collection(self): """websearch - browse for a record that belongs to a restricted collection as an authorized user.""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?p=CERN-THESIS-99-074&f=088__a&action_browse=Browse&c=ALEPH+Theses&ln=en', username='admin', password='', expected_text= ['Hits', '>CERN-THESIS-99-074']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_browse_exact_author_help_link(self): error_messages = test_web_page_content(CFG_SITE_URL + '/search?ln=en&p=Dasse%2C+Michel&f=author&action_browse=Browse', username = 'guest', expected_text = ['Did you mean to browse in', 'index?']) if error_messages: self.fail(merge_error_messages(error_messages)) error_messages = test_web_page_content(CFG_SITE_URL + '/search?ln=en&p=Dasse%2C+Michel&f=firstauthor&action_browse=Browse', username = 'guest', expected_text = ['Did you mean to browse in', 'index?']) if error_messages: self.fail(merge_error_messages(error_messages)) error_messages = test_web_page_content(CFG_SITE_URL + '/search?ln=en&as=1&m1=a&p1=Dasse%2C+Michel&f1=author&op1=a&m2=a&p2=&f2=firstauthor&op2=a&m3=a&p3=&f3=&action_browse=Browse', username = 'guest', expected_text = ['Did you mean to browse in', 'index?']) if error_messages: self.fail(merge_error_messages(error_messages)) class WebSearchTestOpenURL(unittest.TestCase): def test_isbn_01(self): """ websearch - isbn query via OpenURL 0.1""" browser = Browser() # We do a precise search in an isolated collection browser.open(make_url('/openurl', isbn='0387940758')) dummy, current_q = parse_url(browser.geturl()) self.failUnlessEqual(current_q, { 'sc' : ['1'], 'p' : ['isbn:"0387940758"'], 'of' : ['hd'] }) def test_isbn_10_rft_id(self): """ websearch - isbn query via OpenURL 1.0 - rft_id""" browser = Browser() # We do a precise search in an isolated collection browser.open(make_url('/openurl', rft_id='urn:ISBN:0387940758')) dummy, current_q = parse_url(browser.geturl()) self.failUnlessEqual(current_q, { 'sc' : ['1'], 'p' : ['isbn:"0387940758"'], 'of' : ['hd'] }) def test_isbn_10(self): """ websearch - isbn query via OpenURL 1.0""" browser = Browser() # We do a precise search in an isolated collection browser.open(make_url('/openurl?rft.isbn=0387940758')) dummy, current_q = parse_url(browser.geturl()) self.failUnlessEqual(current_q, { 'sc' : ['1'], 'p' : ['isbn:"0387940758"'], 'of' : ['hd'] }) class WebSearchTestSearch(unittest.TestCase): def test_hits_in_other_collection(self): """ websearch - check extension of a query to the home collection """ browser = Browser() # We do a precise search in an isolated collection browser.open(make_url('/collection/ISOLDE', ln='en')) browser.select_form(name='search') browser['f'] = ['author'] browser['p'] = 'matsubara' browser.submit() dummy, current_q = parse_url(browser.geturl()) link = browser.find_link(text_regex=re.compile('.*hit', re.I)) dummy, target_q = parse_url(link.url) # the target query should be the current query without any c # or cc specified. for f in ('cc', 'c', 'action_search'): if f in current_q: del current_q[f] self.failUnlessEqual(current_q, target_q) def test_nearest_terms(self): """ websearch - provide a list of nearest terms """ browser = Browser() browser.open(make_url('')) # Search something weird browser.select_form(name='search') browser['p'] = 'gronf' browser.submit() dummy, original = parse_url(browser.geturl()) for to_drop in ('cc', 'action_search', 'f'): if to_drop in original: del original[to_drop] if 'ln' not in original: original['ln'] = [CFG_SITE_LANG] # we should get a few searches back, which are identical # except for the p field being substituted (and the cc field # being dropped). if 'cc' in original: del original['cc'] for link in browser.links(url_regex=re.compile(CFG_SITE_URL + r'/search\?')): if link.text == 'Advanced Search': continue dummy, target = parse_url(link.url) if 'ln' not in target: target['ln'] = [CFG_SITE_LANG] original['p'] = [link.text] self.failUnlessEqual(original, target) return def test_switch_to_simple_search(self): """ websearch - switch to simple search """ browser = Browser() args = {'as': 1} browser.open(make_url('/collection/ISOLDE', **args)) browser.select_form(name='search') browser['p1'] = 'tandem' browser['f1'] = ['title'] browser.submit() browser.follow_link(text='Simple Search') dummy, q = parse_url(browser.geturl()) self.failUnlessEqual(q, {'cc': ['ISOLDE'], 'p': ['tandem'], 'f': ['title'], 'ln': ['en']}) def test_switch_to_advanced_search(self): """ websearch - switch to advanced search """ browser = Browser() browser.open(make_url('/collection/ISOLDE')) browser.select_form(name='search') browser['p'] = 'tandem' browser['f'] = ['title'] browser.submit() browser.follow_link(text='Advanced Search') dummy, q = parse_url(browser.geturl()) self.failUnlessEqual(q, {'cc': ['ISOLDE'], 'p1': ['tandem'], 'f1': ['title'], 'as': ['1'], 'ln' : ['en']}) def test_no_boolean_hits(self): """ websearch - check the 'no boolean hits' proposed links """ browser = Browser() browser.open(make_url('')) browser.select_form(name='search') browser['p'] = 'quasinormal muon' browser.submit() dummy, q = parse_url(browser.geturl()) for to_drop in ('cc', 'action_search', 'f'): if to_drop in q: del q[to_drop] for bsu in ('quasinormal', 'muon'): l = browser.find_link(text=bsu) q['p'] = bsu if not same_urls_p(l.url, make_url('/search', **q)): self.fail(repr((l.url, make_url('/search', **q)))) def test_similar_authors(self): """ websearch - test similar authors box """ browser = Browser() browser.open(make_url('')) browser.select_form(name='search') browser['p'] = 'Ellis, R K' browser['f'] = ['author'] browser.submit() l = browser.find_link(text="Ellis, R S") self.failUnless(same_urls_p(l.url, make_url('/search', p="Ellis, R S", f='author', ln='en'))) def test_em_parameter(self): """ websearch - check different values of em return different parts of the search page""" for combi in string_combinations(["K", "A", "I", "O"]): url = '/search?ln=en&cc=Articles+%%26+Preprints&sc=1&c=Articles&c=Preprints&em=%s' % ','.join(combi) expected_text = ["Development of photon beam diagnostics for VUV radiation from a SASE FEL"] unexpected_text = [] if "H" in combi: expected_text.append(">Atlantis Institute of Fictive Science") else: unexpected_text.append(">Atlantis Institute of Fictive Science") if "F" in combi: expected_text.append("This site is also available in the following languages:") else: unexpected_text.append("This site is also available in the following languages:") if "S" in combi: expected_text.append('value="Search"') else: unexpected_text.append('value="Search"') if "K" in combi: expected_text.append('value="Add to basket"') else: unexpected_text.append('value="Add to basket"') if "A" in combi: expected_text.append('Interested in being notified about new results for this query?') else: unexpected_text.append('Interested in being notified about new results for this query?') if "I" in combi: expected_text.append('jump to record:') else: unexpected_text.append('jump to record:') if "O" in combi: expected_text.append('Results overview: Found ') else: unexpected_text.append('Results overview: Found ') self.assertEqual([], test_web_page_content(make_url(url), expected_text=expected_text, unexpected_text=unexpected_text)) return class WebSearchTestWildcardLimit(unittest.TestCase): """Checks if the wildcard limit is correctly passed and that users without autorization can not exploit it""" def test_wildcard_limit_correctly_passed_when_not_set(self): """websearch - wildcard limit is correctly passed when default""" self.assertEqual(search_pattern(p='e*', f='author'), search_pattern(p='e*', f='author', wl=1000)) def test_wildcard_limit_correctly_passed_when_set(self): """websearch - wildcard limit is correctly passed when set""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=e*&f=author&of=id&wl=5&rg=100', expected_text="[9, 10, 11, 17, 46, 48, 50, 51, 52, 53, 54, 67, 72, 74, 81, 88, 92, 96]")) def test_wildcard_limit_correctly_not_active(self): """websearch - wildcard limit is not active when there is no wildcard query""" self.assertEqual(search_pattern(p='ellis', f='author'), search_pattern(p='ellis', f='author', wl=1)) def test_wildcard_limit_increased_by_authorized_users(self): """websearch - wildcard limit increased by authorized user""" browser = Browser() #try a search query, with no wildcard limit set by the user browser.open(make_url('/search?p=a*&of=id')) recid_list_guest_no_limit = browser.response().read() # so the limit is CGF_WEBSEARCH_WILDCARD_LIMIT #try a search query, with a wildcard limit imposed by the user #wl=1000000 - a very high limit,higher then what the CFG_WEBSEARCH_WILDCARD_LIMIT might be browser.open(make_url('/search?p=a*&of=id&wl=1000000')) recid_list_guest_with_limit = browser.response().read() #same results should be returned for a search without the wildcard limit set by the user #and for a search with a large limit set by the user #in this way we know that nomatter how large the limit is, the wildcard query will be #limitted by CFG_WEBSEARCH_WILDCARD_LIMIT (for a guest user) self.failIf(len(recid_list_guest_no_limit.split(',')) != len(recid_list_guest_with_limit.split(','))) ##login as admin browser.open(make_surl('/youraccount/login')) browser.select_form(nr=0) browser['p_un'] = 'admin' browser['p_pw'] = '' browser.submit() #try a search query, with a wildcard limit imposed by an authorized user #wl = 10000 a very high limit, higher then what the CFG_WEBSEARCH_WILDCARD_LIMIT might be browser.open(make_surl('/search?p=a*&of=id&wl=10000')) recid_list_authuser_with_limit = browser.response().read() #the authorized user can set whatever limit he might wish #so, the results returned for the auth. users should exceed the results returned for unauth. users self.failUnless(len(recid_list_guest_no_limit.split(',')) <= len(recid_list_authuser_with_limit.split(','))) #logout browser.open(make_surl('/youraccount/logout')) browser.response().read() browser.close() class WebSearchNearestTermsTest(unittest.TestCase): """Check various alternatives of searches leading to the nearest terms box.""" def test_nearest_terms_box_in_okay_query(self): """ websearch - no nearest terms box for a successful query """ self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=ellis', expected_text="jump to record")) def test_nearest_terms_box_in_unsuccessful_simple_query(self): """ websearch - nearest terms box for unsuccessful simple query """ self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=ellisz', expected_text="Nearest terms in any collection are", expected_link_target=CFG_SITE_URL+"/search?ln=en&p=embed", expected_link_label='embed')) def test_nearest_terms_box_in_unsuccessful_simple_accented_query(self): """ websearch - nearest terms box for unsuccessful accented query """ self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=elliszà', expected_text="Nearest terms in any collection are", expected_link_target=CFG_SITE_URL+"/search?ln=en&p=embed", expected_link_label='embed')) def test_nearest_terms_box_in_unsuccessful_structured_query(self): """ websearch - nearest terms box for unsuccessful structured query """ self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=ellisz&f=author', expected_text="Nearest terms in any collection are", expected_link_target=CFG_SITE_URL+"/search?ln=en&p=fabbro&f=author", expected_link_label='fabbro')) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=author%3Aellisz', expected_text="Nearest terms in any collection are", expected_link_target=CFG_SITE_URL+"/search?ln=en&p=author%3Afabbro", expected_link_label='fabbro')) def test_nearest_terms_box_in_query_with_invalid_index(self): """ websearch - nearest terms box for queries with invalid indexes specified """ self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=bednarz%3Aellis', expected_text="Nearest terms in any collection are", expected_link_target=CFG_SITE_URL+"/search?ln=en&p=bednarz", expected_link_label='bednarz')) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=1%3Aellis', expected_text="no index 1.", expected_link_target=CFG_SITE_URL+"/record/47?ln=en", expected_link_label="Detailed record")) def test_nearest_terms_box_in_unsuccessful_phrase_query(self): """ websearch - nearest terms box for unsuccessful phrase query """ self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=author%3A%22Ellis%2C+Z%22', expected_text="Nearest terms in any collection are", expected_link_target=CFG_SITE_URL+"/search?ln=en&p=author%3A%22Enqvist%2C+K%22", expected_link_label='Enqvist, K')) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=%22ellisz%22&f=author', expected_text="Nearest terms in any collection are", expected_link_target=CFG_SITE_URL+"/search?ln=en&p=%22Enqvist%2C+K%22&f=author", expected_link_label='Enqvist, K')) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=%22elliszà%22&f=author', expected_text="Nearest terms in any collection are", expected_link_target=CFG_SITE_URL+"/search?ln=en&p=%22Enqvist%2C+K%22&f=author", expected_link_label='Enqvist, K')) def test_nearest_terms_box_in_unsuccessful_partial_phrase_query(self): """ websearch - nearest terms box for unsuccessful partial phrase query """ self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=author%3A%27Ellis%2C+Z%27', expected_text="Nearest terms in any collection are", expected_link_target=CFG_SITE_URL+"/search?ln=en&p=author%3A%27Enqvist%2C+K%27", expected_link_label='Enqvist, K')) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=%27ellisz%27&f=author', expected_text="Nearest terms in any collection are", expected_link_target=CFG_SITE_URL+"/search?ln=en&p=%27Enqvist%2C+K%27&f=author", expected_link_label='Enqvist, K')) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=%27elliszà%27&f=author', expected_text="Nearest terms in any collection are", expected_link_target=CFG_SITE_URL+"/search?ln=en&p=%27Enqvist%2C+K%27&f=author", expected_link_label='Enqvist, K')) def test_nearest_terms_box_in_unsuccessful_partial_phrase_advanced_query(self): """ websearch - nearest terms box for unsuccessful partial phrase advanced search query """ self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p1=aaa&f1=title&m1=p&as=1', expected_text="Nearest terms in any collection are", expected_link_target=CFG_SITE_URL+"/search?ln=en&f1=title&as=1&p1=A+simple+functional+form+for+proton-nucleus+total+reaction+cross+sections&m1=p", expected_link_label='A simple functional form for proton-nucleus total reaction cross sections')) def test_nearest_terms_box_in_unsuccessful_exact_phrase_advanced_query(self): """ websearch - nearest terms box for unsuccessful exact phrase advanced search query """ self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p1=aaa&f1=title&m1=e&as=1', expected_text="Nearest terms in any collection are", expected_link_target=CFG_SITE_URL+"/search?ln=en&f1=title&as=1&p1=A+simple+functional+form+for+proton-nucleus+total+reaction+cross+sections&m1=e", expected_link_label='A simple functional form for proton-nucleus total reaction cross sections')) def test_nearest_terms_box_in_unsuccessful_boolean_query(self): """ websearch - nearest terms box for unsuccessful boolean query """ self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=title%3Aellisz+author%3Aellisz', expected_text="Nearest terms in any collection are", expected_link_target=CFG_SITE_URL+"/search?ln=en&p=title%3Aenergi+author%3Aellisz", expected_link_label='energi')) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=title%3Aenergi+author%3Aenergie', expected_text="Nearest terms in any collection are", expected_link_target=CFG_SITE_URL+"/search?ln=en&p=title%3Aenergi+author%3Aenqvist", expected_link_label='enqvist')) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?ln=en&p=title%3Aellisz+author%3Aellisz&f=keyword', expected_text="Nearest terms in any collection are", expected_link_target=CFG_SITE_URL+"/search?ln=en&p=title%3Aenergi+author%3Aellisz&f=keyword", expected_link_label='energi')) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?ln=en&p=title%3Aenergi+author%3Aenergie&f=keyword', expected_text="Nearest terms in any collection are", expected_link_target=CFG_SITE_URL+"/search?ln=en&p=title%3Aenergi+author%3Aenqvist&f=keyword", expected_link_label='enqvist')) def test_nearest_terms_box_in_unsuccessful_uppercase_query(self): """ websearch - nearest terms box for unsuccessful uppercase query """ self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=fOo%3Atest', expected_text="Nearest terms in any collection are", expected_link_target=CFG_SITE_URL+"/search?ln=en&p=food", expected_link_label='food')) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=arXiv%3A1007.5048', expected_text="Nearest terms in any collection are", expected_link_target=CFG_SITE_URL+"/search?ln=en&p=artist", expected_link_label='artist')) def test_nearest_terms_box_in_unsuccessful_spires_query(self): """ websearch - nearest terms box for unsuccessful spires query """ self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?ln=en&p=find+a+foobar', expected_text="Nearest terms in any collection are", expected_link_target=CFG_SITE_URL+"/search?ln=en&p=find+a+finch", expected_link_label='finch')) class WebSearchBooleanQueryTest(unittest.TestCase): """Check various boolean queries.""" def test_successful_boolean_query(self): """ websearch - successful boolean query """ self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=ellis+muon', expected_text="records found", expected_link_label="Detailed record")) def test_unsuccessful_boolean_query_where_all_individual_terms_match(self): """ websearch - unsuccessful boolean query where all individual terms match """ self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=ellis+muon+letter', expected_text="Boolean query returned no hits. Please combine your search terms differently.")) def test_unsuccessful_boolean_query_in_advanced_search_where_all_individual_terms_match(self): """ websearch - unsuccessful boolean query in advanced search where all individual terms match """ self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?m1=a&p1=ellis&op1=a&m2=a&p2=muon&op2=a&p3=letter', expected_text="Boolean query returned no hits. Please combine your search terms differently.")) class WebSearchAuthorQueryTest(unittest.TestCase): """Check various author-related queries.""" def test_propose_similar_author_names_box(self): """ websearch - propose similar author names box """ self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=Ellis%2C+R&f=author', expected_text="See also: similar author names", expected_link_target=CFG_SITE_URL+"/search?ln=en&p=Ellis%2C+R+K&f=author", expected_link_label="Ellis, R K")) def test_do_not_propose_similar_author_names_box(self): """ websearch - do not propose similar author names box """ errmsgs = test_web_page_content(CFG_SITE_URL + '/search?p=author%3A%22Ellis%2C+R%22', expected_link_target=CFG_SITE_URL+"/search?ln=en&p=Ellis%2C+R+K&f=author", expected_link_label="Ellis, R K") if errmsgs[0].find("does not contain link to") > -1: pass else: self.fail("Should not propose similar author names box.") return class WebSearchSearchEnginePythonAPITest(unittest.TestCase): """Check typical search engine Python API calls on the demo data.""" def test_search_engine_python_api_for_failed_query(self): """websearch - search engine Python API for failed query""" self.assertEqual([], perform_request_search(p='aoeuidhtns')) def test_search_engine_python_api_for_successful_query(self): """websearch - search engine Python API for successful query""" self.assertEqual([8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 47], perform_request_search(p='ellis')) def test_search_engine_web_api_ignore_paging_parameter(self): """websearch - search engine Python API for successful query, ignore paging parameters""" self.assertEqual([8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 47], perform_request_search(p='ellis', rg=5, jrec=3)) def test_search_engine_web_api_respect_sorting_parameter(self): """websearch - search engine Python API for successful query, respect sorting parameters""" self.assertEqual([77, 84, 85], perform_request_search(p='klebanov')) self.assertEqual([77, 85, 84], perform_request_search(p='klebanov', sf='909C4v')) def test_search_engine_web_api_respect_ranking_parameter(self): """websearch - search engine Python API for successful query, respect ranking parameters""" self.assertEqual([77, 84, 85], perform_request_search(p='klebanov')) self.assertEqual([85, 77, 84], perform_request_search(p='klebanov', rm='citation')) def test_search_engine_python_api_for_existing_record(self): """websearch - search engine Python API for existing record""" self.assertEqual([8], perform_request_search(recid=8)) def test_search_engine_python_api_for_nonexisting_record(self): """websearch - search engine Python API for non-existing record""" self.assertEqual([], perform_request_search(recid=16777215)) def test_search_engine_python_api_for_nonexisting_collection(self): """websearch - search engine Python API for non-existing collection""" self.assertEqual([], perform_request_search(c='Foo')) def test_search_engine_python_api_for_range_of_records(self): """websearch - search engine Python API for range of records""" self.assertEqual([1, 2, 3, 4, 5, 6, 7, 8, 9], perform_request_search(recid=1, recidb=10)) def test_search_engine_python_api_ranked_by_citation(self): """websearch - search engine Python API for citation ranking""" self.assertEqual([82, 83, 87, 89], perform_request_search(p='recid:81', rm='citation')) def test_search_engine_python_api_textmarc(self): """websearch - search engine Python API for Text MARC output""" # we are testing example from /help/hacking/search-engine-api tmp = cStringIO.StringIO() perform_request_search(req=tmp, p='higgs', of='tm', ot=['100', '700']) out = tmp.getvalue() tmp.close() self.assertEqual(out, """\ 000000085 100__ $$aGirardello, L$$uINFN$$uUniversita di Milano-Bicocca 000000085 700__ $$aPorrati, Massimo 000000085 700__ $$aZaffaroni, A 000000001 100__ $$aPhotolab """) def test_search_engine_python_api_for_intersect_results_with_one_collrec(self): """websearch - search engine Python API for intersect results with one collrec""" self.assertEqual({'Books & Reports': intbitset([19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34])}, intersect_results_with_collrecs(None, intbitset(range(0,110)), ['Books & Reports'], 0, 'id', 0, 'en', False)) def test_search_engine_python_api_for_intersect_results_with_several_collrecs(self): """websearch - search engine Python API for intersect results with several collrecs""" self.assertEqual({'Books': intbitset([21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]), 'Reports': intbitset([19, 20]), 'Theses': intbitset([35, 36, 37, 38, 39, 40, 41, 42, 105])}, intersect_results_with_collrecs(None, intbitset(range(0,110)), ['Books', 'Theses', 'Reports'], 0, 'id', 0, 'en', False)) class WebSearchSearchEngineWebAPITest(unittest.TestCase): """Check typical search engine Web API calls on the demo data.""" def test_search_engine_web_api_for_failed_query(self): """websearch - search engine Web API for failed query""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=aoeuidhtns&of=id', expected_text="[]")) def test_search_engine_web_api_for_successful_query(self): """websearch - search engine Web API for successful query""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=ellis&of=id', expected_text="[8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 47]")) def test_search_engine_web_api_ignore_paging_parameter(self): """websearch - search engine Web API for successful query, ignore paging parameters""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=ellis&of=id&rg=5&jrec=3', expected_text="[8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 47]")) def test_search_engine_web_api_respect_sorting_parameter(self): """websearch - search engine Web API for successful query, respect sorting parameters""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=klebanov&of=id', expected_text="[84, 85]")) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=klebanov&of=id', username="admin", expected_text="[77, 84, 85]")) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=klebanov&of=id&sf=909C4v', expected_text="[85, 84]")) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=klebanov&of=id&sf=909C4v', username="admin", expected_text="[77, 85, 84]")) def test_search_engine_web_api_respect_ranking_parameter(self): """websearch - search engine Web API for successful query, respect ranking parameters""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=klebanov&of=id', expected_text="[84, 85]")) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=klebanov&of=id', username="admin", expected_text="[77, 84, 85]")) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=klebanov&of=id&rm=citation', expected_text="[85, 84]")) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=klebanov&of=id&rm=citation', username="admin", expected_text="[85, 77, 84]")) def test_search_engine_web_api_for_existing_record(self): """websearch - search engine Web API for existing record""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?recid=8&of=id', expected_text="[8]")) def test_search_engine_web_api_for_nonexisting_record(self): """websearch - search engine Web API for non-existing record""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?recid=123456789&of=id', expected_text="[]")) def test_search_engine_web_api_for_nonexisting_collection(self): """websearch - search engine Web API for non-existing collection""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?c=Foo&of=id', expected_text="[]")) def test_search_engine_web_api_for_range_of_records(self): """websearch - search engine Web API for range of records""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?recid=1&recidb=10&of=id', expected_text="[1, 2, 3, 4, 5, 6, 7, 8, 9]")) class WebSearchRestrictedCollectionTest(unittest.TestCase): """Test of the restricted collections behaviour.""" def test_restricted_collection_interface_page(self): """websearch - restricted collection interface page body""" # there should be no Latest additions box for restricted collections self.assertNotEqual([], test_web_page_content(CFG_SITE_URL + '/collection/Theses', expected_text="Latest additions")) def test_restricted_search_as_anonymous_guest(self): """websearch - restricted collection not searchable by anonymous guest""" browser = Browser() browser.open(CFG_SITE_URL + '/search?c=Theses') response = browser.response().read() if response.find("If you think you have right to access it, please authenticate yourself.") > -1: pass else: self.fail("Oops, searching restricted collection without password should have redirected to login dialog.") return def test_restricted_search_as_authorized_person(self): """websearch - restricted collection searchable by authorized person""" browser = Browser() browser.open(CFG_SITE_URL + '/search?c=Theses') browser.select_form(nr=0) browser['p_un'] = 'jekyll' browser['p_pw'] = 'j123ekyll' browser.submit() if browser.response().read().find("records found") > -1: pass else: self.fail("Oops, Dr. Jekyll should be able to search Theses collection.") def test_restricted_search_as_unauthorized_person(self): """websearch - restricted collection not searchable by unauthorized person""" browser = Browser() browser.open(CFG_SITE_URL + '/search?c=Theses') browser.select_form(nr=0) browser['p_un'] = 'hyde' browser['p_pw'] = 'h123yde' browser.submit() # Mr. Hyde should not be able to connect: if browser.response().read().find("Authorization failure") <= -1: # if we got here, things are broken: self.fail("Oops, Mr.Hyde should not be able to search Theses collection.") def test_restricted_detailed_record_page_as_anonymous_guest(self): """websearch - restricted detailed record page not accessible to guests""" browser = Browser() browser.open(CFG_SITE_URL + '/%s/35' % CFG_SITE_RECORD) if browser.response().read().find("You can use your nickname or your email address to login.") > -1: pass else: self.fail("Oops, searching restricted collection without password should have redirected to login dialog.") return def test_restricted_detailed_record_page_as_authorized_person(self): """websearch - restricted detailed record page accessible to authorized person""" browser = Browser() browser.open(CFG_SITE_URL + '/youraccount/login') browser.select_form(nr=0) browser['p_un'] = 'jekyll' browser['p_pw'] = 'j123ekyll' browser.submit() browser.open(CFG_SITE_URL + '/%s/35' % CFG_SITE_RECORD) # Dr. Jekyll should be able to connect # (add the pw to the whole CFG_SITE_URL because we shall be # redirected to '/reordrestricted/'): if browser.response().read().find("A High-performance Video Browsing System") > -1: pass else: self.fail("Oops, Dr. Jekyll should be able to access restricted detailed record page.") def test_restricted_detailed_record_page_as_unauthorized_person(self): """websearch - restricted detailed record page not accessible to unauthorized person""" browser = Browser() browser.open(CFG_SITE_URL + '/youraccount/login') browser.select_form(nr=0) browser['p_un'] = 'hyde' browser['p_pw'] = 'h123yde' browser.submit() browser.open(CFG_SITE_URL + '/%s/35' % CFG_SITE_RECORD) # Mr. Hyde should not be able to connect: if browser.response().read().find('You are not authorized') <= -1: # if we got here, things are broken: self.fail("Oops, Mr.Hyde should not be able to access restricted detailed record page.") def test_collection_restricted_p(self): """websearch - collection_restricted_p""" self.failUnless(collection_restricted_p('Theses'), True) self.failIf(collection_restricted_p('Books & Reports')) def test_get_permitted_restricted_collections(self): """websearch - get_permitted_restricted_collections""" from invenio.webuser import get_uid_from_email, collect_user_info self.assertEqual(get_permitted_restricted_collections(collect_user_info(get_uid_from_email('jekyll@cds.cern.ch'))), ['Theses', 'Drafts']) self.assertEqual(get_permitted_restricted_collections(collect_user_info(get_uid_from_email('hyde@cds.cern.ch'))), []) self.assertEqual(get_permitted_restricted_collections(collect_user_info(get_uid_from_email('balthasar.montague@cds.cern.ch'))), ['ALEPH Theses', 'ALEPH Internal Notes', 'Atlantis Times Drafts']) self.assertEqual(get_permitted_restricted_collections(collect_user_info(get_uid_from_email('dorian.gray@cds.cern.ch'))), ['ISOLDE Internal Notes']) def test_restricted_record_has_restriction_flag(self): """websearch - restricted record displays a restriction flag""" browser = Browser() browser.open(CFG_SITE_URL + '/%s/42/files/' % CFG_SITE_RECORD) browser.select_form(nr=0) browser['p_un'] = 'jekyll' browser['p_pw'] = 'j123ekyll' browser.submit() if browser.response().read().find("Restricted") > -1: pass else: self.fail("Oops, a 'Restricted' flag should appear on restricted records.") browser.open(CFG_SITE_URL + '/%s/42/files/comments' % CFG_SITE_RECORD) if browser.response().read().find("Restricted") > -1: pass else: self.fail("Oops, a 'Restricted' flag should appear on restricted records.") # Flag also appear on records that exist both in a public and # restricted collection: error_messages = test_web_page_content(CFG_SITE_URL + '/%s/109' % CFG_SITE_RECORD, username='admin', password='', expected_text=['Restricted']) if error_messages: self.fail("Oops, a 'Restricted' flag should appear on restricted records.") class WebSearchRestrictedCollectionHandlingTest(unittest.TestCase): """ Check how the restricted or restricted and "hidden" collection handling works: (i)user has or not rights to access to specific records or collections, (ii)public and restricted results are displayed in the right position in the collection tree, (iii)display the right warning depending on the case. Changes in the collection tree used for testing (are showed the records used for testing as well): Articles & Preprints Books & Reports _____________|________________ ____________|_____________ | | | | | | | Articles Drafts(r) Notes Preprints Books Theses(r) Reports 69 77 109 10 105 77 98 98 108 105 CERN Experiments _________________________|___________________________ | | ALEPH ISOLDE _________________|_________________ ____________|_____________ | | | | | ALEPH ALEPH ALEPH ISOLDE ISOLDE Papers Internal Notes(r) Theses(r) Papers Internal Notes(r&h) 10 109 105 69 110 108 106 Authorized users: jekyll -> Drafts, Theses balthasar -> ALEPH Internal Notes, ALEPH Theses dorian -> ISOLDE Internal Notes """ def test_show_public_colls_in_warning_as_unauthorizad_user(self): """websearch - show public daugther collections in warning to unauthorized user""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?ln=en&cc=Articles+%26+Preprints&sc=1&p=recid:20', username='hyde', password='h123yde', expected_text=['No match found in collection Articles, Preprints, Notes.']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_show_public_and_restricted_colls_in_warning_as_authorized_user(self): """websearch - show public and restricted daugther collections in warning to authorized user""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?ln=en&cc=Articles+%26+Preprints&sc=1&p=recid:20', username='jekyll', password='j123ekyll', expected_text=['No match found in collection Articles, Preprints, Notes, Drafts.']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_restricted_record_in_different_colls_as_unauthorized_user(self): """websearch - record belongs to different restricted collections with different rights, user not has rights""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?p=105&f=recid', username='hyde', password='h123yde', expected_text=['No public collection matched your query.'], unexpected_text=['records found']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_restricted_record_in_different_colls_as_authorized_user_of_one_coll(self): """websearch - record belongs to different restricted collections with different rights, balthasar has rights to one of them""" from invenio.config import CFG_WEBSEARCH_VIEWRESTRCOLL_POLICY policy = CFG_WEBSEARCH_VIEWRESTRCOLL_POLICY.strip().upper() if policy == 'ANY': error_messages = test_web_page_content(CFG_SITE_URL + '/search?&sc=1&p=recid:105&c=Articles+%26+Preprints&c=Books+%26+Reports&c=Multimedia+%26+Arts', username='balthasar', password='b123althasar', expected_text=['[CERN-THESIS-99-074]'], unexpected_text=['No public collection matched your query.']) else: error_messages = test_web_page_content(CFG_SITE_URL + '/search?&sc=1&p=recid:105&c=Articles+%26+Preprints&c=Books+%26+Reports&c=Multimedia+%26+Arts', username='balthasar', password='b123althasar', expected_text=['No public collection matched your query.'], unexpected_text=['[CERN-THESIS-99-074]']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_restricted_record_in_different_colls_as_authorized_user_of_two_colls(self): """websearch - record belongs to different restricted collections with different rights, jekyll has rights to two of them""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?&sc=1&p=recid:105&c=Articles+%26+Preprints&c=Books+%26+Reports&c=Multimedia+%26+Arts', username='jekyll', password='j123ekyll', expected_text=['Articles & Preprints', 'Books & Reports']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_restricted_record_in_different_colls_as_authorized_user_of_all_colls(self): """websearch - record belongs to different restricted collections with different rights, admin has rights to all of them""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?&sc=1&p=recid:105&c=Articles+%26+Preprints&c=Books+%26+Reports&c=Multimedia+%26+Arts', username='admin', expected_text=['Articles & Preprints', 'Books & Reports', 'ALEPH Theses']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_search_restricted_record_from_not_dad_coll(self): """websearch - record belongs to different restricted collections with different rights, search from a not dad collection""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?ln=en&cc=Multimedia+%26+Arts&sc=1&p=recid%3A105&f=&action_search=Search&c=Pictures&c=Poetry&c=Atlantis+Times', username='admin', expected_text='No match found in collection', expected_link_label='1 hits') if error_messages: self.fail(merge_error_messages(error_messages)) def test_public_and_restricted_record_as_unauthorized_user(self): """websearch - record belongs to different public and restricted collections, user not has rights""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?&sc=1&p=geometry&c=Articles+%26+Preprints&c=Books+%26+Reports&c=Multimedia+%26+Arts&of=id', username='guest', expected_text='[80, 86]', unexpected_text='[40, 80, 86]') if error_messages: self.fail(merge_error_messages(error_messages)) def test_public_and_restricted_record_as_authorized_user(self): """websearch - record belongs to different public and restricted collections, admin has rights""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?&sc=1&p=geometry&c=Articles+%26+Preprints&c=Books+%26+Reports&c=Multimedia+%26+Arts&of=id', username='admin', password='', expected_text='[40, 80, 86]') if error_messages: self.fail(merge_error_messages(error_messages)) def test_public_and_restricted_record_of_focus_as_unauthorized_user(self): """websearch - record belongs to both a public and a restricted collection of "focus on", user not has rights""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?ln=en&cc=Articles+%26+Preprints&sc=1&p=109&f=recid', username='hyde', password='h123yde', expected_text=['No public collection matched your query'], unexpected_text=['LEP Center-of-Mass Energies in Presence of Opposite']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_public_and_restricted_record_of_focus_as_authorized_user(self): """websearch - record belongs to both a public and a restricted collection of "focus on", user has rights""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?&sc=1&p=109&f=recid&c=Articles+%26+Preprints&c=Books+%26+Reports&c=Multimedia+%26+Arts', username='balthasar', password='b123althasar', expected_text=['Articles & Preprints', 'ALEPH Internal Notes', 'LEP Center-of-Mass Energies in Presence of Opposite']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_search_public_and_restricted_record_from_not_dad_coll_as_authorized_user(self): """websearch - record belongs to both a public and a restricted collection, search from a not dad collection, admin has rights""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?ln=en&cc=Books+%26+Reports&sc=1&p=recid%3A98&f=&action_search=Search&c=Books&c=Reports', username='admin', password='', expected_text='No match found in collection Books, Theses, Reports', expected_link_label='1 hits') if error_messages: self.fail(merge_error_messages(error_messages)) def test_search_public_and_restricted_record_from_not_dad_coll_as_unauthorized_user(self): """websearch - record belongs to both a public and a restricted collection, search from a not dad collection, hyde not has rights""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?ln=en&cc=Books+%26+Reports&sc=1&p=recid%3A98&f=&action_search=Search&c=Books&c=Reports', username='hyde', password='h123yde', expected_text='No public collection matched your query', unexpected_text='No match found in collection') if error_messages: self.fail(merge_error_messages(error_messages)) def test_restricted_record_of_focus_as_authorized_user(self): """websearch - record belongs to a restricted collection of "focus on", balthasar has rights""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?&sc=1&p=106&f=recid&c=Articles+%26+Preprints&c=Books+%26+Reports&c=Multimedia+%26+Arts&of=id', username='balthasar', password='b123althasar', expected_text='[106]', unexpected_text='[]') if error_messages: self.fail(merge_error_messages(error_messages)) def test_display_dad_coll_of_restricted_coll_as_unauthorized_user(self): """websearch - unauthorized user displays a collection that contains a restricted collection""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?ln=en&cc=Articles+%26+Preprints&sc=1&p=&f=&action_search=Search&c=Articles&c=Drafts&c=Preprints', username='guest', expected_text=['This collection is restricted.']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_display_dad_coll_of_restricted_coll_as_authorized_user(self): """websearch - authorized user displays a collection that contains a restricted collection""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?ln=en&cc=Articles+%26+Preprints&sc=1&p=&f=&action_search=Search&c=Articles&c=Drafts&c=Notes&c=Preprints', username='jekyll', password='j123ekyll', expected_text=['Articles', 'Drafts', 'Notes', 'Preprints'], unexpected_text=['This collection is restricted.']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_search_restricted_record_from_coll_of_focus_as_unauthorized_user(self): """websearch - search for a record that belongs to a restricted collection from a collection of "focus on" , jekyll not has rights""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?ln=en&cc=CERN+Divisions&sc=1&p=recid%3A106&f=&action_search=Search&c=Experimental+Physics+(EP)&c=Theoretical+Physics+(TH)', username='jekyll', password='j123ekyll', expected_text=['No public collection matched your query.']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_search_restricted_record_from_coll_of_focus_as_authorized_user(self): """websearch - search for a record that belongs to a restricted collection from a collection of "focus on" , admin has rights""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?ln=en&cc=CERN+Divisions&sc=1&p=recid%3A106&f=&action_search=Search&c=Experimental+Physics+(EP)&c=Theoretical+Physics+(TH)', username='admin', password='', expected_text='No match found in collection Experimental Physics (EP), Theoretical Physics (TH).', expected_link_label='1 hits') if error_messages: self.fail(merge_error_messages(error_messages)) def test_search_restricted_record_from_not_direct_dad_coll_and_display_in_right_position_in_tree(self): """websearch - search for a restricted record from not direct dad collection and display it on its right position in the tree""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?ln=en&sc=1&p=recid%3A40&f=&action_search=Search&c=Articles+%26+Preprints&c=Books+%26+Reports&c=Multimedia+%26+Arts', username='admin', password='', expected_text=['Books & Reports','[LBL-22304]']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_search_restricted_record_from_direct_dad_coll_and_display_in_right_position_in_tree(self): """websearch - search for a restricted record from the direct dad collection and display it on its right position in the tree""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?ln=en&cc=Books+%26+Reports&sc=1&p=recid%3A40&f=&action_search=Search&c=Books&c=Reports', username='admin', password='', expected_text=['Theses', '[LBL-22304]']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_restricted_and_hidden_record_as_unauthorized_user(self): """websearch - search for a "hidden" record, user not has rights""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?ln=en&sc=1&p=recid%3A110&f=&action_search=Search&c=Articles+%26+Preprints&c=Books+%26+Reports&c=Multimedia+%26+Arts', username='guest', expected_text=['If you were looking for a non-public document'], unexpected_text=['If you were looking for a hidden document']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_restricted_and_hidden_record_as_authorized_user(self): """websearch - search for a "hidden" record, admin has rights""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?ln=en&sc=1&p=recid%3A110&f=&action_search=Search&c=Articles+%26+Preprints&c=Books+%26+Reports&c=Multimedia+%26+Arts', username='admin', password='', expected_text=['If you were looking for a hidden document, please type the correct URL for this record.']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_enter_url_of_restricted_and_hidden_coll_as_unauthorized_user(self): """websearch - unauthorized user types the concret URL of a "hidden" collection""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?ln=en&cc=ISOLDE+Internal+Notes&sc=1&p=&f=&action_search=Search', username='guest', expected_text=['This collection is restricted.']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_enter_url_of_restricted_and_hidden_coll_as_authorized_user(self): """websearch - authorized user types the concret URL of a "hidden" collection""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?ln=en&cc=ISOLDE+Internal+Notes&sc=1&p=&f=&action_search=Search', username='dorian', password='d123orian', expected_text=['ISOLDE Internal Notes', '[CERN-PS-PA-Note-93-04]'], unexpected_text=['This collection is restricted.']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_search_for_pattern_from_the_top_as_unauthorized_user(self): """websearch - unauthorized user searches for a pattern from the top""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?ln=en&sc=1&p=of&f=&action_search=Search&c=Articles+%26+Preprints&c=Books+%26+Reports&c=Multimedia+%26+Arts', username='guest', expected_text=['Articles & Preprints', '61', 'records found', 'Books & Reports', '2', 'records found', 'Multimedia & Arts', '14', 'records found']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_search_for_pattern_from_the_top_as_authorized_user(self): """websearch - authorized user searches for a pattern from the top""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?ln=en&sc=1&p=of&f=&action_search=Search&c=Articles+%26+Preprints&c=Books+%26+Reports&c=Multimedia+%26+Arts', username='admin', password='', expected_text=['Articles & Preprints', '61', 'records found', 'Books & Reports', '6', 'records found', 'Multimedia & Arts', '14', 'records found', 'ALEPH Theses', '1', 'records found', 'ALEPH Internal Notes', '1', 'records found']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_search_for_pattern_from_an_specific_coll_as_unauthorized_user(self): """websearch - unauthorized user searches for a pattern from one specific collection""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?ln=en&cc=Books+%26+Reports&sc=1&p=of&f=&action_search=Search&c=Books&c=Reports', username='guest', expected_text=['Books', '1', 'records found', 'Reports', '1', 'records found']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_search_for_pattern_from_an_specific_coll_as_authorized_user(self): """websearch - authorized user searches for a pattern from one specific collection""" error_messages = test_web_page_content(CFG_SITE_URL + '/search?ln=en&cc=Books+%26+Reports&sc=1&p=of&f=&action_search=Search&c=Books&c=Reports', username='admin', password='', expected_text=['Books', '1', 'records found', 'Reports', '1', 'records found', 'Theses', '4', 'records found']) if error_messages: self.fail(merge_error_messages(error_messages)) class WebSearchRestrictedPicturesTest(unittest.TestCase): """ Check whether restricted pictures on the demo site can be accessed well by people who have rights to access them. """ def test_restricted_pictures_guest(self): """websearch - restricted pictures not available to guest""" error_messages = test_web_page_content(CFG_SITE_URL + '/%s/1/files/0106015_01.jpg' % CFG_SITE_RECORD, expected_text=['This file is restricted. If you think you have right to access it, please authenticate yourself.']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_restricted_pictures_romeo(self): """websearch - restricted pictures available to Romeo""" error_messages = test_web_page_content(CFG_SITE_URL + '/%s/1/files/0106015_01.jpg' % CFG_SITE_RECORD, username='romeo', password='r123omeo', expected_text=[], unexpected_text=['This file is restricted', 'You are not authorized']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_restricted_pictures_hyde(self): """websearch - restricted pictures not available to Mr. Hyde""" error_messages = test_web_page_content(CFG_SITE_URL + '/%s/1/files/0106015_01.jpg' % CFG_SITE_RECORD, username='hyde', password='h123yde', expected_text=['This file is restricted', 'You are not authorized']) if error_messages: self.failUnless("HTTP Error 401: Unauthorized" in merge_error_messages(error_messages)) class WebSearchRestrictedWebJournalFilesTest(unittest.TestCase): """ Check whether files attached to a WebJournal article are well accessible when the article is published """ def test_restricted_files_guest(self): """websearch - files of unreleased articles are not available to guest""" # Record is not public... self.assertEqual(record_public_p(112), False) # ... and guest cannot access attached files error_messages = test_web_page_content(CFG_SITE_URL + '/%s/112/files/journal_galapagos_archipelago.jpg' % CFG_SITE_RECORD, expected_text=['This file is restricted. If you think you have right to access it, please authenticate yourself.']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_restricted_files_editor(self): """websearch - files of unreleased articles are available to editor""" # Record is not public... self.assertEqual(record_public_p(112), False) # ... but editor can access attached files error_messages = test_web_page_content(CFG_SITE_URL + '/%s/112/files/journal_galapagos_archipelago.jpg' % CFG_SITE_RECORD, username='balthasar', password='b123althasar', expected_text=[], unexpected_text=['This file is restricted', 'You are not authorized']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_public_files_guest(self): """websearch - files of released articles are available to guest""" # Record is not public... self.assertEqual(record_public_p(111), False) # ... but user can access attached files, as article is released error_messages = test_web_page_content(CFG_SITE_URL + '/%s/111/files/journal_scissor_beak.jpg' % CFG_SITE_RECORD, expected_text=[], unexpected_text=['This file is restricted', 'You are not authorized']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_really_restricted_files_guest(self): """websearch - restricted files of released articles are not available to guest""" # Record is not public... self.assertEqual(record_public_p(111), False) # ... and user cannot access restricted attachements, even if # article is released error_messages = test_web_page_content(CFG_SITE_URL + '/%s/111/files/restricted-journal_scissor_beak.jpg' % CFG_SITE_RECORD, expected_text=['This file is restricted. If you think you have right to access it, please authenticate yourself.']) if error_messages: self.fail(merge_error_messages(error_messages)) def test_restricted_picture_has_restriction_flag(self): """websearch - restricted files displays a restriction flag""" error_messages = test_web_page_content(CFG_SITE_URL + '/%s/1/files/' % CFG_SITE_RECORD, expected_text="Restricted") if error_messages: self.fail(merge_error_messages(error_messages)) class WebSearchRSSFeedServiceTest(unittest.TestCase): """Test of the RSS feed service.""" def test_rss_feed_service(self): """websearch - RSS feed service""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/rss', expected_text=' -1: self.fail("Oops, when split by collection is off, " "results overview should not be present.") if body.find('') == -1: self.fail("Oops, when split by collection is off, " "Atlantis collection should be found.") if body.find('') > -1: self.fail("Oops, when split by collection is off, " "Multimedia & Arts should not be found.") try: browser.find_link(url='#15') self.fail("Oops, when split by collection is off, " "a link to Multimedia & Arts should not be found.") except LinkNotFoundError: pass def test_results_overview_split_on(self): """websearch - results overview box when split by collection is on""" browser = Browser() browser.open(CFG_SITE_URL + '/search?p=of&sc=1') body = browser.response().read() if body.find("Results overview") == -1: self.fail("Oops, when split by collection is on, " "results overview should be present.") if body.find('') > -1: self.fail("Oops, when split by collection is on, " "Atlantis collection should not be found.") if body.find('') == -1: self.fail("Oops, when split by collection is on, " "Multimedia & Arts should be found.") try: browser.find_link(url='#15') except LinkNotFoundError: self.fail("Oops, when split by collection is on, " "a link to Multimedia & Arts should be found.") class WebSearchSortResultsTest(unittest.TestCase): """Test of the search results page's sorting capability.""" def test_sort_results_default(self): """websearch - search results sorting, default method""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=of&f=title&rg=1', expected_text="CMS animation of the high-energy collisions")) def test_sort_results_ascending(self): """websearch - search results sorting, ascending field""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=of&f=title&rg=2&sf=reportnumber&so=a', expected_text="[astro-ph/0104076]")) def test_sort_results_descending(self): """websearch - search results sorting, descending field""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=of&f=title&rg=1&sf=reportnumber&so=d', expected_text=" [TESLA-FEL-99-07]")) def test_sort_results_sort_pattern(self): """websearch - search results sorting, preferential sort pattern""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=of&f=title&rg=1&sf=reportnumber&so=d&sp=cern', expected_text="[CERN-TH-2002-069]")) class WebSearchSearchResultsXML(unittest.TestCase): """Test search results in various output""" def test_search_results_xm_output_split_on(self): """ websearch - check document element of search results in xm output (split by collection on)""" browser = Browser() browser.open(CFG_SITE_URL + '/search?sc=1&of=xm') body = browser.response().read() num_doc_element = body.count("") if num_doc_element == 0: self.fail("Oops, no document element " "found in search results.") elif num_doc_element > 1: self.fail("Oops, multiple document elements " "found in search results.") num_doc_element = body.count("") if num_doc_element == 0: self.fail("Oops, no document element " "found in search results.") elif num_doc_element > 1: self.fail("Oops, multiple document elements " "found in search results.") def test_search_results_xm_output_split_off(self): """ websearch - check document element of search results in xm output (split by collection off)""" browser = Browser() browser.open(CFG_SITE_URL + '/search?sc=0&of=xm') body = browser.response().read() num_doc_element = body.count("") if num_doc_element == 0: self.fail("Oops, no document element " "found in search results.") elif num_doc_element > 1: self.fail("Oops, multiple document elements " "found in search results.") num_doc_element = body.count("") if num_doc_element == 0: self.fail("Oops, no document element " "found in search results.") elif num_doc_element > 1: self.fail("Oops, multiple document elements " "found in search results.") def test_search_results_xd_output_split_on(self): """ websearch - check document element of search results in xd output (split by collection on)""" browser = Browser() browser.open(CFG_SITE_URL + '/search?sc=1&of=xd') body = browser.response().read() num_doc_element = body.count("" "found in search results.") elif num_doc_element > 1: self.fail("Oops, multiple document elements " "found in search results.") num_doc_element = body.count("") if num_doc_element == 0: self.fail("Oops, no document element " "found in search results.") elif num_doc_element > 1: self.fail("Oops, multiple document elements " "found in search results.") def test_search_results_xd_output_split_off(self): """ websearch - check document element of search results in xd output (split by collection off)""" browser = Browser() browser.open(CFG_SITE_URL + '/search?sc=0&of=xd') body = browser.response().read() num_doc_element = body.count("") if num_doc_element == 0: self.fail("Oops, no document element " "found in search results.") elif num_doc_element > 1: self.fail("Oops, multiple document elements " "found in search results.") num_doc_element = body.count("") if num_doc_element == 0: self.fail("Oops, no document element " "found in search results.") elif num_doc_element > 1: self.fail("Oops, multiple document elements " "found in search results.") class WebSearchUnicodeQueryTest(unittest.TestCase): """Test of the search results for queries containing Unicode characters.""" def test_unicode_word_query(self): """websearch - Unicode word query""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?of=id&p=title%3A%CE%99%CE%B8%CE%AC%CE%BA%CE%B7', expected_text="[76]")) def test_unicode_word_query_not_found_term(self): """websearch - Unicode word query, not found term""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=title%3A%CE%99%CE%B8', expected_text="ιθάκη")) def test_unicode_exact_phrase_query(self): """websearch - Unicode exact phrase query""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?of=id&p=title%3A%22%CE%99%CE%B8%CE%AC%CE%BA%CE%B7%22', expected_text="[76]")) def test_unicode_partial_phrase_query(self): """websearch - Unicode partial phrase query""" # no hit here for example title partial phrase query due to # removed difference between double-quoted and single-quoted # search: self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?of=id&p=title%3A%27%CE%B7%27', expected_text="[]")) def test_unicode_regexp_query(self): """websearch - Unicode regexp query""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?of=id&p=title%3A%2F%CE%B7%2F', expected_text="[76]")) class WebSearchMARCQueryTest(unittest.TestCase): """Test of the search results for queries containing physical MARC tags.""" def test_single_marc_tag_exact_phrase_query(self): """websearch - single MARC tag, exact phrase query (100__a)""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?of=id&p=100__a%3A%22Ellis%2C+J%22', expected_text="[9, 14, 18]")) def test_single_marc_tag_partial_phrase_query(self): """websearch - single MARC tag, partial phrase query (245__b)""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?of=id&p=245__b%3A%27and%27', expected_text="[28]")) def test_many_marc_tags_partial_phrase_query(self): """websearch - many MARC tags, partial phrase query (245)""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?of=id&p=245%3A%27and%27&rg=100', expected_text="[1, 8, 9, 14, 15, 20, 22, 24, 28, 33, 47, 48, 49, 51, 53, 64, 69, 71, 79, 82, 83, 85, 91, 96, 108]")) def test_single_marc_tag_regexp_query(self): """websearch - single MARC tag, regexp query""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?of=id&p=245%3A%2Fand%2F&rg=100', expected_text="[1, 8, 9, 14, 15, 20, 22, 24, 28, 33, 47, 48, 49, 51, 53, 64, 69, 71, 79, 82, 83, 85, 91, 96, 108]")) class WebSearchExtSysnoQueryTest(unittest.TestCase): """Test of queries using external system numbers.""" def test_existing_sysno_html_output(self): """websearch - external sysno query, existing sysno, HTML output""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?sysno=000289446CER', expected_text="The wall of the cave")) def test_existing_sysno_id_output(self): """websearch - external sysno query, existing sysno, ID output""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?sysno=000289446CER&of=id', expected_text="[95]")) def test_nonexisting_sysno_html_output(self): """websearch - external sysno query, non-existing sysno, HTML output""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?sysno=000289446CERRRR', expected_text="Requested record does not seem to exist.")) def test_nonexisting_sysno_id_output(self): """websearch - external sysno query, non-existing sysno, ID output""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?sysno=000289446CERRRR&of=id', expected_text="[]")) class WebSearchResultsRecordGroupingTest(unittest.TestCase): """Test search results page record grouping (rg).""" def test_search_results_rg_guest(self): """websearch - search results, records in groups of, guest""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?rg=17', expected_text="1 - 17")) def test_search_results_rg_nonguest(self): """websearch - search results, records in groups of, non-guest""" # This test used to fail due to saved user preference fetching # not overridden by URL rg argument. self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?rg=17', username='admin', expected_text="1 - 17")) class WebSearchSpecialTermsQueryTest(unittest.TestCase): """Test of the search results for queries containing special terms.""" def test_special_terms_u1(self): """websearch - query for special terms, U(1)""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?of=id&p=U%281%29', expected_text="[57, 79, 80, 88]")) def test_special_terms_u1_and_sl(self): """websearch - query for special terms, U(1) SL(2,Z)""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?of=id&p=U%281%29+SL%282%2CZ%29', expected_text="[88]")) def test_special_terms_u1_and_sl_or(self): """websearch - query for special terms, U(1) OR SL(2,Z)""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?of=id&p=U%281%29+OR+SL%282%2CZ%29', expected_text="[57, 79, 80, 88]")) @nottest def FIXME_TICKET_453_test_special_terms_u1_and_sl_or_parens(self): """websearch - query for special terms, (U(1) OR SL(2,Z))""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?of=id&p=%28U%281%29+OR+SL%282%2CZ%29%29', expected_text="[57, 79, 80, 88]")) def test_special_terms_u1_and_sl_in_quotes(self): """websearch - query for special terms, ('SL(2,Z)' OR 'U(1)')""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + "/search?of=id&p=%28%27SL%282%2CZ%29%27+OR+%27U%281%29%27%29", expected_text="[57, 79, 80, 88, 96]")) class WebSearchJournalQueryTest(unittest.TestCase): """Test of the search results for journal pubinfo queries.""" def test_query_journal_title_only(self): """websearch - journal publication info query, title only""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?of=id&f=journal&p=Phys.+Lett.+B', expected_text="[78, 85, 87]")) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?of=id&f=journal&p=Phys.+Lett.+B', username='admin', expected_text="[77, 78, 85, 87]")) def test_query_journal_full_pubinfo(self): """websearch - journal publication info query, full reference""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?of=id&f=journal&p=Phys.+Lett.+B+531+%282002%29+301', expected_text="[78]")) class WebSearchStemmedIndexQueryTest(unittest.TestCase): """Test of the search results for queries using stemmed indexes.""" def test_query_stemmed_lowercase(self): """websearch - stemmed index query, lowercase""" # note that dasse/Dasse is stemmed into dass/Dass, as expected self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?of=id&p=dasse', expected_text="[25, 26]")) def test_query_stemmed_uppercase(self): """websearch - stemmed index query, uppercase""" # ... but note also that DASSE is stemmed into DASSE(!); so # the test would fail if the search engine would not lower the # query term. (Something that is not necessary for # non-stemmed indexes.) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?of=id&p=DASSE', expected_text="[25, 26]")) class WebSearchSummarizerTest(unittest.TestCase): """Test of the search results summarizer functions.""" def test_most_popular_field_values_singletag(self): """websearch - most popular field values, simple tag""" from invenio.search_engine import get_most_popular_field_values self.assertEqual([('PREPRINT', 37), ('ARTICLE', 28), ('BOOK', 14), ('THESIS', 8), ('PICTURE', 7), ('DRAFT', 2), ('POETRY', 2), ('REPORT', 2), ('ALEPHPAPER', 1), ('ATLANTISTIMESNEWS', 1), ('ISOLDEPAPER', 1)], get_most_popular_field_values(range(0,100), '980__a')) def test_most_popular_field_values_singletag_multiexclusion(self): """websearch - most popular field values, simple tag, multiple exclusions""" from invenio.search_engine import get_most_popular_field_values self.assertEqual([('PREPRINT', 37), ('ARTICLE', 28), ('BOOK', 14), ('DRAFT', 2), ('REPORT', 2), ('ALEPHPAPER', 1), ('ATLANTISTIMESNEWS', 1), ('ISOLDEPAPER', 1)], get_most_popular_field_values(range(0,100), '980__a', ('THESIS', 'PICTURE', 'POETRY'))) def test_most_popular_field_values_multitag(self): """websearch - most popular field values, multiple tags""" from invenio.search_engine import get_most_popular_field_values self.assertEqual([('Ellis, J', 3), ('Enqvist, K', 1), ('Ibanez, L E', 1), ('Nanopoulos, D V', 1), ('Ross, G G', 1)], get_most_popular_field_values((9, 14, 18), ('100__a', '700__a'))) def test_most_popular_field_values_multitag_singleexclusion(self): """websearch - most popular field values, multiple tags, single exclusion""" from invenio.search_engine import get_most_popular_field_values self.assertEqual([('Enqvist, K', 1), ('Ibanez, L E', 1), ('Nanopoulos, D V', 1), ('Ross, G G', 1)], get_most_popular_field_values((9, 14, 18), ('100__a', '700__a'), ('Ellis, J'))) def test_most_popular_field_values_multitag_countrepetitive(self): """websearch - most popular field values, multiple tags, counting repetitive occurrences""" from invenio.search_engine import get_most_popular_field_values self.assertEqual([('THESIS', 2), ('REPORT', 1)], get_most_popular_field_values((41,), ('690C_a', '980__a'), count_repetitive_values=True)) self.assertEqual([('REPORT', 1), ('THESIS', 1)], get_most_popular_field_values((41,), ('690C_a', '980__a'), count_repetitive_values=False)) def test_ellis_citation_summary(self): """websearch - query ellis, citation summary output format""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=ellis&of=hcs', expected_text="Less known papers (1-9)", expected_link_target=CFG_SITE_URL+"/search?p=ellis%20AND%20cited%3A1-%3E9", expected_link_label='1')) def test_ellis_not_quark_citation_summary_advanced(self): """websearch - ellis and not quark, citation summary format advanced""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?ln=en&as=1&m1=a&p1=ellis&f1=author&op1=n&m2=a&p2=quark&f2=&op2=a&m3=a&p3=&f3=&action_search=Search&sf=&so=a&rm=&rg=10&sc=1&of=hcs', expected_text="Less known papers (1-9)", expected_link_target=CFG_SITE_URL+'/search?p=author%3Aellis%20and%20not%20quark%20AND%20cited%3A1-%3E9', expected_link_label='1')) def test_ellis_not_quark_citation_summary_regular(self): """websearch - ellis and not quark, citation summary format advanced""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?ln=en&p=author%3Aellis+and+not+quark&f=&action_search=Search&sf=&so=d&rm=&rg=10&sc=0&of=hcs', expected_text="Less known papers (1-9)", expected_link_target=CFG_SITE_URL+'/search?p=author%3Aellis%20and%20not%20quark%20AND%20cited%3A1-%3E9', expected_link_label='1')) class WebSearchRecordCollectionGuessTest(unittest.TestCase): """Primary collection guessing tests.""" def test_guess_primary_collection_of_a_record(self): """websearch - guess_primary_collection_of_a_record""" self.assertEqual(guess_primary_collection_of_a_record(96), 'Articles') def test_guess_collection_of_a_record(self): """websearch - guess_collection_of_a_record""" self.assertEqual(guess_collection_of_a_record(96), 'Articles') self.assertEqual(guess_collection_of_a_record(96, '%s/collection/Theoretical Physics (TH)?ln=en' % CFG_SITE_URL), 'Articles') self.assertEqual(guess_collection_of_a_record(12, '%s/collection/Theoretical Physics (TH)?ln=en' % CFG_SITE_URL), 'Theoretical Physics (TH)') self.assertEqual(guess_collection_of_a_record(12, '%s/collection/Theoretical%%20Physics%%20%%28TH%%29?ln=en' % CFG_SITE_URL), 'Theoretical Physics (TH)') class WebSearchGetFieldValuesTest(unittest.TestCase): """Testing get_fieldvalues() function.""" def test_get_fieldvalues_001(self): """websearch - get_fieldvalues() for bibxxx-agnostic tags""" self.assertEqual(get_fieldvalues(10, '001___'), ['10']) def test_get_fieldvalues_980(self): """websearch - get_fieldvalues() for bibxxx-powered tags""" self.assertEqual(get_fieldvalues(18, '700__a'), ['Enqvist, K', 'Nanopoulos, D V']) self.assertEqual(get_fieldvalues(18, '909C1u'), ['CERN']) def test_get_fieldvalues_wildcard(self): """websearch - get_fieldvalues() for tag wildcards""" self.assertEqual(get_fieldvalues(18, '%'), []) self.assertEqual(get_fieldvalues(18, '7%'), []) self.assertEqual(get_fieldvalues(18, '700%'), ['Enqvist, K', 'Nanopoulos, D V']) self.assertEqual(get_fieldvalues(18, '909C0%'), ['1985', '13','TH']) def test_get_fieldvalues_recIDs(self): """websearch - get_fieldvalues() for list of recIDs""" self.assertEqual(get_fieldvalues([], '001___'), []) self.assertEqual(get_fieldvalues([], '700__a'), []) self.assertEqual(get_fieldvalues([10, 13], '001___'), ['10', '13']) self.assertEqual(get_fieldvalues([18, 13], '700__a'), ['Dawson, S', 'Ellis, R K', 'Enqvist, K', 'Nanopoulos, D V']) def test_get_fieldvalues_repetitive(self): """websearch - get_fieldvalues() for repetitive values""" self.assertEqual(get_fieldvalues([17, 18], '909C1u'), ['CERN', 'CERN']) self.assertEqual(get_fieldvalues([17, 18], '909C1u', repetitive_values=True), ['CERN', 'CERN']) self.assertEqual(get_fieldvalues([17, 18], '909C1u', repetitive_values=False), ['CERN']) class WebSearchAddToBasketTest(unittest.TestCase): """Test of the add-to-basket presence depending on user rights.""" def test_add_to_basket_guest(self): """websearch - add-to-basket facility allowed for guests""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=recid%3A10', expected_text='Add to basket')) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=recid%3A10', expected_text='')) def test_add_to_basket_jekyll(self): """websearch - add-to-basket facility allowed for Dr. Jekyll""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=recid%3A10', expected_text='Add to basket', username='jekyll', password='j123ekyll')) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=recid%3A10', expected_text='', username='jekyll', password='j123ekyll')) def test_add_to_basket_hyde(self): """websearch - add-to-basket facility denied to Mr. Hyde""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=recid%3A10', unexpected_text='Add to basket', username='hyde', password='h123yde')) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=recid%3A10', unexpected_text='', username='hyde', password='h123yde')) class WebSearchAlertTeaserTest(unittest.TestCase): """Test of the alert teaser presence depending on user rights.""" def test_alert_teaser_guest(self): """websearch - alert teaser allowed for guests""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=ellis', expected_link_label='email alert')) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=ellis', expected_text='RSS feed')) def test_alert_teaser_jekyll(self): """websearch - alert teaser allowed for Dr. Jekyll""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=ellis', expected_text='email alert', username='jekyll', password='j123ekyll')) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=ellis', expected_text='RSS feed', username='jekyll', password='j123ekyll')) def test_alert_teaser_hyde(self): """websearch - alert teaser allowed for Mr. Hyde""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=ellis', expected_text='email alert', username='hyde', password='h123yde')) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=ellis', expected_text='RSS feed', username='hyde', password='h123yde')) class WebSearchSpanQueryTest(unittest.TestCase): """Test of span queries.""" def test_span_in_word_index(self): """websearch - span query in a word index""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=year%3A1992-%3E1996&of=id&ap=0', expected_text='[17, 66, 69, 71]')) def test_span_in_phrase_index(self): """websearch - span query in a phrase index""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=year%3A%221992%22-%3E%221996%22&of=id&ap=0', expected_text='[17, 66, 69, 71]')) def test_span_in_bibxxx(self): """websearch - span query in MARC tables""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=909C0y%3A%221992%22-%3E%221996%22&of=id&ap=0', expected_text='[17, 66, 69, 71]')) def test_span_with_spaces(self): """websearch - no span query when a space is around""" # useful for reaction search self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=title%3A%27mu%20--%3E%20e%27&of=id&ap=0', expected_text='[67]')) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=245%3A%27mu%20--%3E%20e%27&of=id&ap=0', expected_text='[67]')) def test_span_in_author(self): """websearch - span query in special author index""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=author%3A%22Ellis,%20K%22-%3E%22Ellis,%20RZ%22&of=id&ap=0', expected_text='[8, 11, 13, 17, 47]')) class WebSearchReferstoCitedbyTest(unittest.TestCase): """Test of refersto/citedby search operators.""" def test_refersto_recid(self): 'websearch - refersto:recid:84' self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=refersto%3Arecid%3A84&of=id&ap=0', expected_text='[85, 88, 91]')) def test_refersto_repno(self): 'websearch - refersto:reportnumber:hep-th/0205061' self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=refersto%3Areportnumber%3Ahep-th/0205061&of=id&ap=0', expected_text='[91]')) def test_refersto_author_word(self): 'websearch - refersto:author:klebanov' self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=refersto%3Aauthor%3Aklebanov&of=id&ap=0', expected_text='[85, 86, 88, 91]')) def test_refersto_author_phrase(self): 'websearch - refersto:author:"Klebanov, I"' self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=refersto%3Aauthor%3A%22Klebanov,%20I%22&of=id&ap=0', expected_text='[85, 86, 88, 91]')) def test_citedby_recid(self): 'websearch - citedby:recid:92' self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=citedby%3Arecid%3A92&of=id&ap=0', expected_text='[74, 91]')) def test_citedby_repno(self): 'websearch - citedby:reportnumber:hep-th/0205061' self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=citedby%3Areportnumber%3Ahep-th/0205061&of=id&ap=0', expected_text='[78]')) def test_citedby_author_word(self): 'websearch - citedby:author:klebanov' self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=citedby%3Aauthor%3Aklebanov&of=id&ap=0', expected_text='[95]')) def test_citedby_author_phrase(self): 'websearch - citedby:author:"Klebanov, I"' self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=citedby%3Aauthor%3A%22Klebanov,%20I%22&of=id&ap=0', expected_text='[95]')) def test_refersto_bad_query(self): 'websearch - refersto:title:' self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=refersto%3Atitle%3A', expected_text='There are no records referring to title:.')) def test_citedby_bad_query(self): 'websearch - citedby:title:' self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=citedby%3Atitle%3A', expected_text='There are no records cited by title:.')) class WebSearchSPIRESSyntaxTest(unittest.TestCase): """Test of SPIRES syntax issues""" if CFG_WEBSEARCH_SPIRES_SYNTAX > 0: def test_and_not_parens(self): 'websearch - find a ellis, j and not a enqvist' self.assertEqual([], test_web_page_content(CFG_SITE_URL +'/search?p=find+a+ellis%2C+j+and+not+a+enqvist&of=id&ap=0', expected_text='[9, 12, 14, 47]')) + if DATEUTIL_AVAILABLE: def test_dadd_search(self): 'websearch - find da > today - 3650' # XXX: assumes we've reinstalled our site in the last 10 years # should return every document in the system self.assertEqual([], test_web_page_content(CFG_SITE_URL +'/search?ln=en&p=find+da+%3E+today+-+3650&f=&of=id', expected_text='[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 99, 100, 101, 102, 103, 104, 107, 108, 113]')) class WebSearchDateQueryTest(unittest.TestCase): """Test various date queries.""" def setUp(self): """Establish variables we plan to re-use""" self.empty = intbitset() def test_search_unit_hits_for_datecreated_previous_millenia(self): """websearch - search_unit with datecreated returns >0 hits for docs in the last 1000 years""" self.assertNotEqual(self.empty, search_unit('1000-01-01->9999-12-31', 'datecreated')) def test_search_unit_hits_for_datemodified_previous_millenia(self): """websearch - search_unit with datemodified returns >0 hits for docs in the last 1000 years""" self.assertNotEqual(self.empty, search_unit('1000-01-01->9999-12-31', 'datemodified')) def test_search_unit_in_bibrec_for_datecreated_previous_millenia(self): """websearch - search_unit_in_bibrec with creationdate gets >0 hits for past 1000 years""" self.assertNotEqual(self.empty, search_unit_in_bibrec("1000-01-01", "9999-12-31", 'creationdate')) def test_search_unit_in_bibrec_for_datecreated_next_millenia(self): """websearch - search_unit_in_bibrec with creationdate gets 0 hits for after year 3000""" self.assertEqual(self.empty, search_unit_in_bibrec("3000-01-01", "9999-12-31", 'creationdate')) class WebSearchSynonymQueryTest(unittest.TestCase): """Test of queries using synonyms.""" def test_journal_phrvd(self): """websearch - search-time synonym search, journal title""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=PHRVD&f=journal&of=id', expected_text="[66, 72]")) def test_journal_phrvd_54_1996_4234(self): """websearch - search-time synonym search, journal article""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=PHRVD%2054%20%281996%29%204234&f=journal&of=id', expected_text="[66]")) def test_journal_beta_decay_title(self): """websearch - index-time synonym search, beta decay in title""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=beta+decay&f=title&of=id', expected_text="[59]")) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=%CE%B2+decay&f=title&of=id', expected_text="[59]")) def test_journal_beta_decay_global(self): """websearch - index-time synonym search, beta decay in any field""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=beta+decay&of=id', expected_text="[52, 59]")) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=%CE%B2+decay&of=id', expected_text="[52, 59]")) def test_journal_beta_title(self): """websearch - index-time synonym search, beta in title""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=beta&f=title&of=id', expected_text="[59]")) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=%CE%B2&f=title&of=id', expected_text="[59]")) def test_journal_beta_global(self): """websearch - index-time synonym search, beta in any field""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=beta&of=id', expected_text="[52, 59]")) self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=%CE%B2&of=id', expected_text="[52, 59]")) class WebSearchWashCollectionsTest(unittest.TestCase): """Test if the collection argument is washed correctly""" def test_wash_coll_when_coll_restricted(self): """websearch - washing of restricted daughter collections""" self.assertEqual( sorted(wash_colls(cc='', c=['Books & Reports', 'Theses'])[1]), ['Books & Reports', 'Theses']) self.assertEqual( sorted(wash_colls(cc='', c=['Books & Reports', 'Theses'])[2]), ['Books & Reports', 'Theses']) class WebSearchAuthorCountQueryTest(unittest.TestCase): """Test of queries using authorcount fields.""" def test_journal_authorcount_word(self): """websearch - author count, word query""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=4&f=authorcount&of=id', expected_text="[51, 54, 59, 66, 92, 96]")) def test_journal_authorcount_phrase(self): """websearch - author count, phrase query""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=%224%22&f=authorcount&of=id', expected_text="[51, 54, 59, 66, 92, 96]")) def test_journal_authorcount_span(self): """websearch - author count, span query""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=authorcount%3A9-%3E16&of=id', expected_text="[69, 71]")) def test_journal_authorcount_plus(self): """websearch - author count, plus query""" self.assertEqual([], test_web_page_content(CFG_SITE_URL + '/search?p=50%2B&f=authorcount&of=id', expected_text="[10, 17]")) class WebSearchPerformRequestSearchRefactoringTest(unittest.TestCase): """Tests the perform request search API after refactoring.""" def _run_test(self, test_args, expected_results): params = {} params.update(map(lambda y: (y[0], ',' in y[1] and ', ' not in y[1] and y[1].split(',') or y[1]), map(lambda x: x.split('=', 1), test_args.split(';')))) #params.update(map(lambda x: x.split('=', 1), test_args.split(';'))) req = cStringIO.StringIO() params['req'] = req recs = perform_request_search(**params) if isinstance(expected_results, str): req.seek(0) recs = req.read() # this is just used to generate the results from the seearch engine before refactoring #if recs != expected_results: # print test_args # print params # print recs self.assertEqual(recs, expected_results, "Error, we expect: %s, and we received: %s" % (expected_results, recs)) def test_queries(self): """websearch - testing p_r_s standard arguments and their combinations""" self._run_test('p=ellis;f=author;action=Search', [8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 47]) self._run_test('p=ellis;f=author;sf=title;action=Search', [8, 16, 14, 9, 11, 17, 18, 12, 10, 47, 13]) self._run_test('p=ellis;f=author;sf=title;wl=5;action=Search', [8, 16, 14, 9, 11, 17, 18, 12, 10, 47, 13]) self._run_test('p=ellis;f=author;sf=title;wl=5;so=a', [13, 47, 10, 12, 18, 17, 11, 9, 14, 16, 8]) self._run_test('p=ellis;f=author;sf=title;wl=5;so=d', [8, 16, 14, 9, 11, 17, 18, 12, 10, 47, 13]) self._run_test('p=ell*;sf=title;wl=5', [8, 15, 16, 14, 9, 11, 17, 18, 12, 10, 47, 13]) self._run_test('p=ell*;sf=title;wl=1', [10]) self._run_test('p=ell*;sf=title;wl=100', [8, 15, 16, 14, 9, 11, 17, 18, 12, 10, 47, 13]) self._run_test('p=muon OR kaon;f=author;sf=title;wl=5;action=Search', []) self._run_test('p=muon OR kaon;sf=title;wl=5;action=Search', [67, 12]) self._run_test('p=muon OR kaon;sf=title;wl=5;c=Articles,Preprints', [67, 12]) self._run_test('p=muon OR kaon;sf=title;wl=5;c=Articles', [67]) self._run_test('p=muon OR kaon;sf=title;wl=5;c=Preprints', [12]) # FIXME_TICKET_1174 # self._run_test('p=el*;rm=citation', [2, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 23, 30, 32, 34, 47, 48, 51, 52, 54, 56, 58, 59, 92, 97, 100, 103, 18, 74, 91, 94, 81]) if not get_external_word_similarity_ranker(): self._run_test('p=el*;rm=wrd', [2, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 23, 30, 32, 34, 47, 48, 51, 52, 54, 56, 58, 59, 74, 81, 91, 92, 94, 97, 100, 103, 109]) self._run_test('p=el*;sf=title', [100, 32, 8, 15, 16, 81, 97, 34, 23, 58, 2, 14, 9, 11, 30, 109, 52, 48, 94, 17, 56, 18, 91, 59, 12, 92, 74, 54, 103, 10, 51, 47, 13]) self._run_test('p=boson;rm=citation', [1, 47, 50, 107, 108, 77, 95]) if not get_external_word_similarity_ranker(): self._run_test('p=boson;rm=wrd', [108, 77, 47, 50, 95, 1, 107]) self._run_test('p1=ellis;f1=author;m1=a;op1=a;p2=john;f2=author;m2=a', []) self._run_test('p1=ellis;f1=author;m1=o;op1=a;p2=john;f2=author;m2=o', []) self._run_test('p1=ellis;f1=author;m1=e;op1=a;p2=john;f2=author;m2=e', []) self._run_test('p1=ellis;f1=author;m1=a;op1=o;p2=john;f2=author;m2=a', [8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 47]) self._run_test('p1=ellis;f1=author;m1=o;op1=o;p2=john;f2=author;m2=o', [8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 47]) self._run_test('p1=ellis;f1=author;m1=e;op1=o;p2=john;f2=author;m2=e', []) self._run_test('p1=ellis;f1=author;m1=a;op1=n;p2=john;f2=author;m2=a', [8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 47]) self._run_test('p1=ellis;f1=author;m1=o;op1=n;p2=john;f2=author;m2=o', [8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 47]) self._run_test('p1=ellis;f1=author;m1=e;op1=n;p2=john;f2=author;m2=e', []) self._run_test('p=Ellis, J;ap=1', [9, 10, 11, 12, 14, 17, 18, 47]) self._run_test('p=Ellis, J;ap=0', [9, 10, 11, 12, 14, 17, 18, 47]) self._run_test('p=recid:148x', []) self._run_test('p=recid:148x;of=xm;rg=200', "\n\n") class WebSearchGetRecordTests(unittest.TestCase): def setUp(self): self.recid = run_sql("INSERT INTO bibrec(creation_date, modification_date) VALUES(NOW(), NOW())") def tearDown(self): run_sql("DELETE FROM bibrec WHERE id=%s", (self.recid,)) def test_get_record(self): """bibformat - test print_record and get_record of empty record""" from invenio.search_engine import print_record, get_record self.assertEqual(print_record(self.recid, 'xm'), ' \n %s\n \n\n ' % self.recid) self.assertEqual(get_record(self.recid), {'001': [([], ' ', ' ', str(self.recid), 1)]}) class WebSearchExactTitleIndexTest(unittest.TestCase): """Checks if exact title index works correctly """ def test_exacttitle_query_solves_problems(self): """websearch - check exacttitle query solves problems""" error_messages = [] error_messages.extend(test_web_page_content(CFG_SITE_URL + "/search?ln=en&p=exacttitle%3A'solves+problems'&f=&action_search=Search", expected_text = "Non-compact supergravity solves problems")) if error_messages: self.fail(merge_error_messages(error_messages)) def test_exacttitle_query_solve_problems(self): """websearch - check exacttitle query solve problems""" error_messages = [] error_messages.extend(test_web_page_content(CFG_SITE_URL + "/search?ln=en&p=exacttitle%3A'solve+problems'&f=&action_search=Search", expected_text = ['Search term', 'solve problems', 'did not match'])) if error_messages: self.fail(merge_error_messages(error_messages)) def test_exacttitle_query_photon_beam(self): """websearch - check exacttitle search photon beam""" error_messages = [] error_messages.extend(test_web_page_content(CFG_SITE_URL + "/search?ln=en&p=exacttitle%3A'photon+beam'&f=&action_search=Search", expected_text = "Development of photon beam diagnostics")) if error_messages: self.fail(merge_error_messages(error_messages)) def test_exacttitle_query_photons_beam(self): """websearch - check exacttitle search photons beam""" error_messages = [] error_messages.extend(test_web_page_content(CFG_SITE_URL + "/search?ln=en&p=exacttitle%3A'photons+beam'&f=&action_search=Search", expected_text = ['Search term', 'photons beam', 'did not match'])) if error_messages: self.fail(merge_error_messages(error_messages)) TEST_SUITE = make_test_suite(WebSearchWebPagesAvailabilityTest, WebSearchTestSearch, WebSearchTestBrowse, WebSearchTestOpenURL, WebSearchTestCollections, WebSearchTestRecord, WebSearchTestLegacyURLs, WebSearchNearestTermsTest, WebSearchBooleanQueryTest, WebSearchAuthorQueryTest, WebSearchSearchEnginePythonAPITest, WebSearchSearchEngineWebAPITest, WebSearchRestrictedCollectionTest, WebSearchRestrictedCollectionHandlingTest, WebSearchRestrictedPicturesTest, WebSearchRestrictedWebJournalFilesTest, WebSearchRSSFeedServiceTest, WebSearchXSSVulnerabilityTest, WebSearchResultsOverview, WebSearchSortResultsTest, WebSearchSearchResultsXML, WebSearchUnicodeQueryTest, WebSearchMARCQueryTest, WebSearchExtSysnoQueryTest, WebSearchResultsRecordGroupingTest, WebSearchSpecialTermsQueryTest, WebSearchJournalQueryTest, WebSearchStemmedIndexQueryTest, WebSearchSummarizerTest, WebSearchRecordCollectionGuessTest, WebSearchGetFieldValuesTest, WebSearchAddToBasketTest, WebSearchAlertTeaserTest, WebSearchSpanQueryTest, WebSearchReferstoCitedbyTest, WebSearchSPIRESSyntaxTest, WebSearchDateQueryTest, WebSearchTestWildcardLimit, WebSearchSynonymQueryTest, WebSearchWashCollectionsTest, WebSearchAuthorCountQueryTest, WebSearchPerformRequestSearchRefactoringTest, WebSearchGetRecordTests, WebSearchExactTitleIndexTest) if __name__ == "__main__": run_test_suite(TEST_SUITE, warn_user=True) diff --git a/modules/websubmit/lib/websubmit_file_converter.py b/modules/websubmit/lib/websubmit_file_converter.py index e57752856..091ef0a50 100644 --- a/modules/websubmit/lib/websubmit_file_converter.py +++ b/modules/websubmit/lib/websubmit_file_converter.py @@ -1,1462 +1,1465 @@ # -*- coding: utf-8 -*- ## This file is part of Invenio. ## Copyright (C) 2009, 2010, 2011, 2012 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """ This module implement fulltext conversion between many different file formats. """ import os import stat import re import sys import shutil import tempfile import HTMLParser import time import subprocess import atexit import signal import threading from logging import DEBUG, getLogger from htmlentitydefs import entitydefs from optparse import OptionParser try: from invenio.hocrlib import create_pdf, extract_hocr, CFG_PPM_RESOLUTION - from pyPdf import PdfFileReader, PdfFileWriter + try: + from PyPDF2 import PdfFileReader, PdfFileWriter + except ImportError: + from pyPdf import PdfFileReader, PdfFileWriter CFG_CAN_DO_OCR = True except ImportError: CFG_CAN_DO_OCR = False from invenio.textutils import wrap_text_in_a_box from invenio.shellutils import run_process_with_timeout, run_shell_command from invenio.config import CFG_TMPDIR, CFG_ETCDIR, CFG_PYLIBDIR, \ CFG_PATH_ANY2DJVU, \ CFG_PATH_PDFINFO, \ CFG_PATH_GS, \ CFG_PATH_PDFOPT, \ CFG_PATH_PDFTOPS, \ CFG_PATH_GZIP, \ CFG_PATH_GUNZIP, \ CFG_PATH_PDFTOTEXT, \ CFG_PATH_PDFTOPPM, \ CFG_PATH_OCROSCRIPT, \ CFG_PATH_DJVUPS, \ CFG_PATH_DJVUTXT, \ CFG_PATH_OPENOFFICE_PYTHON, \ CFG_PATH_PSTOTEXT, \ CFG_PATH_TIFF2PDF, \ CFG_PATH_PS2PDF, \ CFG_OPENOFFICE_SERVER_HOST, \ CFG_OPENOFFICE_SERVER_PORT, \ CFG_OPENOFFICE_USER, \ CFG_PATH_CONVERT, \ CFG_PATH_PAMFILE, \ CFG_BINDIR, \ CFG_LOGDIR, \ CFG_BIBSCHED_PROCESS_USER, \ CFG_BIBDOCFILE_BEST_FORMATS_TO_EXTRACT_TEXT_FROM, \ CFG_BIBDOCFILE_DESIRED_CONVERSIONS from invenio.errorlib import register_exception def get_file_converter_logger(): return getLogger("InvenioWebSubmitFileConverterLogger") CFG_TWO2THREE_LANG_CODES = { 'en': 'eng', 'nl': 'nld', 'es': 'spa', 'de': 'deu', 'it': 'ita', 'fr': 'fra', } CFG_OPENOFFICE_TMPDIR = os.path.join(CFG_TMPDIR, 'ooffice-tmp-files') CFG_GS_MINIMAL_VERSION_FOR_PDFA = "8.65" CFG_GS_MINIMAL_VERSION_FOR_PDFX = "8.52" CFG_ICC_PATH = os.path.join(CFG_ETCDIR, 'websubmit', 'file_converter_templates', 'ISOCoatedsb.icc') CFG_PDFA_DEF_PATH = os.path.join(CFG_ETCDIR, 'websubmit', 'file_converter_templates', 'PDFA_def.ps') CFG_PDFX_DEF_PATH = os.path.join(CFG_ETCDIR, 'websubmit', 'file_converter_templates', 'PDFX_def.ps') CFG_UNOCONV_LOG_PATH = os.path.join(CFG_LOGDIR, 'unoconv.log') _RE_CLEAN_SPACES = re.compile(r'\s+') class InvenioWebSubmitFileConverterError(Exception): pass def get_conversion_map(): """Return a dictionary of the form: '.pdf' : {'.ps.gz' : ('pdf2ps', {param1 : value1...}) """ ret = { '.csv': {}, '.djvu': {}, '.doc': {}, '.docx': {}, '.sxw': {}, '.htm': {}, '.html': {}, '.odp': {}, '.ods': {}, '.odt': {}, '.pdf': {}, '.ppt': {}, '.pptx': {}, '.sxi': {}, '.ps': {}, '.ps.gz': {}, '.rtf': {}, '.tif': {}, '.tiff': {}, '.txt': {}, '.xls': {}, '.xlsx': {}, '.sxc': {}, '.xml': {}, '.hocr': {}, '.pdf;pdfa': {}, '.asc': {}, } if CFG_PATH_GZIP: ret['.ps']['.ps.gz'] = (gzip, {}) if CFG_PATH_GUNZIP: ret['.ps.gz']['.ps'] = (gunzip, {}) if CFG_PATH_ANY2DJVU: ret['.pdf']['.djvu'] = (any2djvu, {}) ret['.ps']['.djvu'] = (any2djvu, {}) if CFG_PATH_DJVUPS: ret['.djvu']['.ps'] = (djvu2ps, {'compress': False}) if CFG_PATH_GZIP: ret['.djvu']['.ps.gz'] = (djvu2ps, {'compress': True}) if CFG_PATH_DJVUTXT: ret['.djvu']['.txt'] = (djvu2text, {}) if CFG_PATH_PSTOTEXT: ret['.ps']['.txt'] = (pstotext, {}) if CFG_PATH_GUNZIP: ret['.ps.gz']['.txt'] = (pstotext, {}) if can_pdfa(): ret['.ps']['.pdf;pdfa'] = (ps2pdfa, {}) ret['.pdf']['.pdf;pdfa'] = (pdf2pdfa, {}) if CFG_PATH_GUNZIP: ret['.ps.gz']['.pdf;pdfa'] = (ps2pdfa, {}) else: if CFG_PATH_PS2PDF: ret['.ps']['.pdf;pdfa'] = (ps2pdf, {}) if CFG_PATH_GUNZIP: ret['.ps.gz']['.pdf'] = (ps2pdf, {}) if can_pdfx(): ret['.ps']['.pdf;pdfx'] = (ps2pdfx, {}) ret['.pdf']['.pdf;pdfx'] = (pdf2pdfx, {}) if CFG_PATH_GUNZIP: ret['.ps.gz']['.pdf;pdfx'] = (ps2pdfx, {}) if CFG_PATH_PDFTOPS: ret['.pdf']['.ps'] = (pdf2ps, {'compress': False}) ret['.pdf;pdfa']['.ps'] = (pdf2ps, {'compress': False}) if CFG_PATH_GZIP: ret['.pdf']['.ps.gz'] = (pdf2ps, {'compress': True}) ret['.pdf;pdfa']['.ps.gz'] = (pdf2ps, {'compress': True}) if CFG_PATH_PDFTOTEXT: ret['.pdf']['.txt'] = (pdf2text, {}) ret['.pdf;pdfa']['.txt'] = (pdf2text, {}) ret['.asc']['.txt'] = (txt2text, {}) ret['.txt']['.txt'] = (txt2text, {}) ret['.csv']['.txt'] = (txt2text, {}) ret['.html']['.txt'] = (html2text, {}) ret['.htm']['.txt'] = (html2text, {}) ret['.xml']['.txt'] = (html2text, {}) if CFG_PATH_TIFF2PDF: ret['.tiff']['.pdf'] = (tiff2pdf, {}) ret['.tif']['.pdf'] = (tiff2pdf, {}) if CFG_PATH_OPENOFFICE_PYTHON and CFG_OPENOFFICE_SERVER_HOST: ret['.rtf']['.odt'] = (unoconv, {'output_format': 'odt'}) ret['.rtf']['.pdf;pdfa'] = (unoconv, {'output_format': 'pdf'}) ret['.rtf']['.txt'] = (unoconv, {'output_format': 'txt'}) ret['.rtf']['.docx'] = (unoconv, {'output_format': 'docx'}) ret['.doc']['.odt'] = (unoconv, {'output_format': 'odt'}) ret['.doc']['.pdf;pdfa'] = (unoconv, {'output_format': 'pdf'}) ret['.doc']['.txt'] = (unoconv, {'output_format': 'txt'}) ret['.doc']['.docx'] = (unoconv, {'output_format': 'docx'}) ret['.docx']['.odt'] = (unoconv, {'output_format': 'odt'}) ret['.docx']['.pdf;pdfa'] = (unoconv, {'output_format': 'pdf'}) ret['.docx']['.txt'] = (unoconv, {'output_format': 'txt'}) ret['.sxw']['.odt'] = (unoconv, {'output_format': 'odt'}) ret['.sxw']['.pdf;pdfa'] = (unoconv, {'output_format': 'pdf'}) ret['.sxw']['.txt'] = (unoconv, {'output_format': 'txt'}) ret['.docx']['.docx'] = (unoconv, {'output_format': 'docx'}) ret['.odt']['.doc'] = (unoconv, {'output_format': 'doc'}) ret['.odt']['.pdf;pdfa'] = (unoconv, {'output_format': 'pdf'}) ret['.odt']['.txt'] = (unoconv, {'output_format': 'txt'}) ret['.odt']['.docx'] = (unoconv, {'output_format': 'docx'}) ret['.ppt']['.odp'] = (unoconv, {'output_format': 'odp'}) ret['.ppt']['.pdf;pdfa'] = (unoconv, {'output_format': 'pdf'}) ret['.ppt']['.txt'] = (unoconv, {'output_format': 'txt'}) ret['.ppt']['.pptx'] = (unoconv, {'output_format': 'pptx'}) ret['.pptx']['.odp'] = (unoconv, {'output_format': 'odp'}) ret['.pptx']['.pdf;pdfa'] = (unoconv, {'output_format': 'pdf'}) ret['.pptx']['.txt'] = (unoconv, {'output_format': 'txt'}) ret['.sxi']['.odp'] = (unoconv, {'output_format': 'odp'}) ret['.sxi']['.pdf;pdfa'] = (unoconv, {'output_format': 'pdf'}) ret['.sxi']['.txt'] = (unoconv, {'output_format': 'txt'}) ret['.sxi']['.pptx'] = (unoconv, {'output_format': 'pptx'}) ret['.odp']['.ppt'] = (unoconv, {'output_format': 'ppt'}) ret['.odp']['.pptx'] = (unoconv, {'output_format': 'pptx'}) ret['.odp']['.pdf;pdfa'] = (unoconv, {'output_format': 'pdf'}) ret['.odp']['.txt'] = (unoconv, {'output_format': 'txt'}) ret['.odp']['.pptx'] = (unoconv, {'output_format': 'pptx'}) ret['.xls']['.ods'] = (unoconv, {'output_format': 'ods'}) ret['.xls']['.xlsx'] = (unoconv, {'output_format': 'xslx'}) ret['.xlsx']['.ods'] = (unoconv, {'output_format': 'ods'}) ret['.sxc']['.ods'] = (unoconv, {'output_format': 'ods'}) ret['.sxc']['.xlsx'] = (unoconv, {'output_format': 'xslx'}) ret['.ods']['.xls'] = (unoconv, {'output_format': 'xls'}) ret['.ods']['.pdf;pdfa'] = (unoconv, {'output_format': 'pdf'}) ret['.ods']['.csv'] = (unoconv, {'output_format': 'csv'}) ret['.ods']['.xlsx'] = (unoconv, {'output_format': 'xslx'}) ret['.csv']['.txt'] = (txt2text, {}) ## Let's add all the existing output formats as potential input formats. for value in ret.values(): for key in value.keys(): if key not in ret: ret[key] = {} return ret def get_best_format_to_extract_text_from(filelist, best_formats=CFG_BIBDOCFILE_BEST_FORMATS_TO_EXTRACT_TEXT_FROM): """ Return among the filelist the best file whose format is best suited for extracting text. """ from invenio.bibdocfile import decompose_file, normalize_format best_formats = [normalize_format(aformat) for aformat in best_formats if can_convert(aformat, '.txt')] for aformat in best_formats: for filename in filelist: if decompose_file(filename, skip_version=True)[2].endswith(aformat): return filename raise InvenioWebSubmitFileConverterError("It's not possible to extract valuable text from any of the proposed files.") def get_missing_formats(filelist, desired_conversion=None): """Given a list of files it will return a dictionary of the form: file1 : missing formats to generate from it... """ from invenio.bibdocfile import normalize_format, decompose_file def normalize_desired_conversion(): ret = {} for key, value in desired_conversion.iteritems(): ret[normalize_format(key)] = [normalize_format(aformat) for aformat in value] return ret if desired_conversion is None: desired_conversion = CFG_BIBDOCFILE_DESIRED_CONVERSIONS available_formats = [decompose_file(filename, skip_version=True)[2] for filename in filelist] missing_formats = [] desired_conversion = normalize_desired_conversion() ret = {} for filename in filelist: aformat = decompose_file(filename, skip_version=True)[2] if aformat in desired_conversion: for desired_format in desired_conversion[aformat]: if desired_format not in available_formats and desired_format not in missing_formats: missing_formats.append(desired_format) if filename not in ret: ret[filename] = [] ret[filename].append(desired_format) return ret def can_convert(input_format, output_format, max_intermediate_conversions=4): """Return the chain of conversion to transform input_format into output_format, if any.""" from invenio.bibdocfile import normalize_format if max_intermediate_conversions <= 0: return [] input_format = normalize_format(input_format) output_format = normalize_format(output_format) if input_format in __CONVERSION_MAP: if output_format in __CONVERSION_MAP[input_format]: return [__CONVERSION_MAP[input_format][output_format]] best_res = [] best_intermediate = '' for intermediate_format in __CONVERSION_MAP[input_format]: res = can_convert(intermediate_format, output_format, max_intermediate_conversions-1) if res and (len(res) < best_res or not best_res): best_res = res best_intermediate = intermediate_format if best_res: return [__CONVERSION_MAP[input_format][best_intermediate]] + best_res return [] def can_pdfopt(verbose=False): """Return True if it's possible to optimize PDFs.""" if CFG_PATH_PDFOPT: return True elif verbose: print >> sys.stderr, "PDF linearization is not supported because the pdfopt executable is not available" return False def can_pdfx(verbose=False): """Return True if it's possible to generate PDF/Xs.""" if not CFG_PATH_PDFTOPS: if verbose: print >> sys.stderr, "Conversion of PS or PDF to PDF/X is not possible because the pdftops executable is not available" return False if not CFG_PATH_GS: if verbose: print >> sys.stderr, "Conversion of PS or PDF to PDF/X is not possible because the gs executable is not available" return False else: try: output = run_shell_command("%s --version" % CFG_PATH_GS)[1].strip() if not output: raise ValueError("No version information returned") if [int(number) for number in output.split('.')] < [int(number) for number in CFG_GS_MINIMAL_VERSION_FOR_PDFX.split('.')]: print >> sys.stderr, "Conversion of PS or PDF to PDF/X is not possible because the minimal gs version for the executable %s is not met: it should be %s but %s has been found" % (CFG_PATH_GS, CFG_GS_MINIMAL_VERSION_FOR_PDFX, output) return False except Exception, err: print >> sys.stderr, "Conversion of PS or PDF to PDF/X is not possible because it's not possible to retrieve the gs version using the executable %s: %s" % (CFG_PATH_GS, err) return False if not CFG_PATH_PDFINFO: if verbose: print >> sys.stderr, "Conversion of PS or PDF to PDF/X is not possible because the pdfinfo executable is not available" return False if not os.path.exists(CFG_ICC_PATH): if verbose: print >> sys.stderr, "Conversion of PS or PDF to PDF/X is not possible because %s does not exists. Have you run make install-pdfa-helper-files?" % CFG_ICC_PATH return False return True def can_pdfa(verbose=False): """Return True if it's possible to generate PDF/As.""" if not CFG_PATH_PDFTOPS: if verbose: print >> sys.stderr, "Conversion of PS or PDF to PDF/A is not possible because the pdftops executable is not available" return False if not CFG_PATH_GS: if verbose: print >> sys.stderr, "Conversion of PS or PDF to PDF/A is not possible because the gs executable is not available" return False else: try: output = run_shell_command("%s --version" % CFG_PATH_GS)[1].strip() if not output: raise ValueError("No version information returned") if [int(number) for number in output.split('.')] < [int(number) for number in CFG_GS_MINIMAL_VERSION_FOR_PDFA.split('.')]: print >> sys.stderr, "Conversion of PS or PDF to PDF/A is not possible because the minimal gs version for the executable %s is not met: it should be %s but %s has been found" % (CFG_PATH_GS, CFG_GS_MINIMAL_VERSION_FOR_PDFA, output) return False except Exception, err: print >> sys.stderr, "Conversion of PS or PDF to PDF/A is not possible because it's not possible to retrieve the gs version using the executable %s: %s" % (CFG_PATH_GS, err) return False if not CFG_PATH_PDFINFO: if verbose: print >> sys.stderr, "Conversion of PS or PDF to PDF/A is not possible because the pdfinfo executable is not available" return False if not os.path.exists(CFG_ICC_PATH): if verbose: print >> sys.stderr, "Conversion of PS or PDF to PDF/A is not possible because %s does not exists. Have you run make install-pdfa-helper-files?" % CFG_ICC_PATH return False return True def can_perform_ocr(verbose=False): """Return True if it's possible to perform OCR.""" if not CFG_CAN_DO_OCR: if verbose: print >> sys.stderr, "OCR is not supported because either the pyPdf of ReportLab Python libraries are missing" return False if not CFG_PATH_OCROSCRIPT: if verbose: print >> sys.stderr, "OCR is not supported because the ocroscript executable is not available" return False if not CFG_PATH_PDFTOPPM: if verbose: print >> sys.stderr, "OCR is not supported because the pdftoppm executable is not available" return False return True def guess_ocropus_produced_garbage(input_file, hocr_p): """Return True if the output produced by OCROpus in hocr format contains only garbage instead of text. This is implemented via an heuristic: if the most common length for sentences encoded in UTF-8 is 1 then this is Garbage (tm). """ def _get_words_from_text(): ret = [] for row in open(input_file): for word in row.strip().split(' '): ret.append(word.strip()) return ret def _get_words_from_hocr(): ret = [] hocr = extract_hocr(open(input_file).read()) for dummy, dummy, lines in hocr: for dummy, line in lines: for word in line.split(): ret.append(word.strip()) return ret if hocr_p: words = _get_words_from_hocr() else: words = _get_words_from_text() #stats = {} #most_common_len = 0 #most_common_how_many = 0 #for word in words: #if word: #word_length = len(word.decode('utf-8')) #stats[word_length] = stats.get(word_length, 0) + 1 #if stats[word_length] > most_common_how_many: #most_common_len = word_length #most_common_how_many = stats[word_length] goods = 0 bads = 0 for word in words: for char in word.decode('utf-8'): if (u'a' <= char <= u'z') or (u'A' <= char <= u'Z'): goods += 1 else: bads += 1 if bads > goods: get_file_converter_logger().debug('OCROpus produced garbage') return True else: return False def guess_is_OCR_needed(input_file, ln='en'): """ Tries to see if enough text is retrievable from input_file. Return True if OCR is needed, False if it's already possible to retrieve information from the document. """ ## FIXME: a way to understand if pdftotext has returned garbage ## shuould be found. E.g. 1.0*len(text)/len(zlib.compress(text)) < 2.1 ## could be a good hint for garbage being found. return True def convert_file(input_file, output_file=None, output_format=None, **params): """ Convert files from one format to another. @param input_file [string] the path to an existing file @param output_file [string] the path to the desired ouput. (if None a temporary file is generated) @param output_format [string] the desired format (if None it is taken from output_file) @param params other paramaters to pass to the particular converter @return [string] the final output_file """ from invenio.bibdocfile import decompose_file, normalize_format if output_format is None: if output_file is None: raise ValueError("At least output_file or format should be specified.") else: output_ext = decompose_file(output_file, skip_version=True)[2] else: output_ext = normalize_format(output_format) input_ext = decompose_file(input_file, skip_version=True)[2] conversion_chain = can_convert(input_ext, output_ext) if conversion_chain: get_file_converter_logger().debug("Conversion chain from %s to %s: %s" % (input_ext, output_ext, conversion_chain)) current_input = input_file for i, (converter, final_params) in enumerate(conversion_chain): current_output = None if i == (len(conversion_chain) - 1): current_output = output_file final_params = dict(final_params) final_params.update(params) try: get_file_converter_logger().debug("Converting from %s to %s using %s with params %s" % (current_input, current_output, converter, final_params)) current_output = converter(current_input, current_output, **final_params) get_file_converter_logger().debug("... current_output %s" % (current_output, )) except InvenioWebSubmitFileConverterError, err: raise InvenioWebSubmitFileConverterError("Error when converting from %s to %s: %s" % (input_file, output_ext, err)) except Exception, err: register_exception(alert_admin=True) raise InvenioWebSubmitFileConverterError("Unexpected error when converting from %s to %s (%s): %s" % (input_file, output_ext, type(err), err)) if current_input != input_file: os.remove(current_input) current_input = current_output return current_output else: raise InvenioWebSubmitFileConverterError("It's impossible to convert from %s to %s" % (input_ext, output_ext)) try: _UNOCONV_DAEMON except NameError: _UNOCONV_DAEMON = None _UNOCONV_DAEMON_LOCK = threading.Lock() def _register_unoconv(): global _UNOCONV_DAEMON if CFG_OPENOFFICE_SERVER_HOST != 'localhost': return _UNOCONV_DAEMON_LOCK.acquire() try: if not _UNOCONV_DAEMON: output_log = open(CFG_UNOCONV_LOG_PATH, 'a') _UNOCONV_DAEMON = subprocess.Popen(['sudo', '-S', '-u', CFG_OPENOFFICE_USER, os.path.join(CFG_BINDIR, 'inveniounoconv'), '-vvv', '-s', CFG_OPENOFFICE_SERVER_HOST, '-p', str(CFG_OPENOFFICE_SERVER_PORT), '-l'], stdin=open('/dev/null', 'r'), stdout=output_log, stderr=output_log) time.sleep(3) finally: _UNOCONV_DAEMON_LOCK.release() def _unregister_unoconv(): global _UNOCONV_DAEMON if CFG_OPENOFFICE_SERVER_HOST != 'localhost': return _UNOCONV_DAEMON_LOCK.acquire() try: if _UNOCONV_DAEMON: output_log = open(CFG_UNOCONV_LOG_PATH, 'a') subprocess.call(['sudo', '-S', '-u', CFG_OPENOFFICE_USER, os.path.join(CFG_BINDIR, 'inveniounoconv'), '-k', '-vvv'], stdin=open('/dev/null', 'r'), stdout=output_log, stderr=output_log) time.sleep(1) if _UNOCONV_DAEMON.poll(): try: os.kill(_UNOCONV_DAEMON.pid, signal.SIGTERM) except OSError: pass if _UNOCONV_DAEMON.poll(): try: os.kill(_UNOCONV_DAEMON.pid, signal.SIGKILL) except OSError: pass finally: _UNOCONV_DAEMON_LOCK.release() ## NOTE: in case we switch back keeping LibreOffice running, uncomment ## the following line. #atexit.register(_unregister_unoconv) def unoconv(input_file, output_file=None, output_format='txt', pdfopt=True, **dummy): """Use unconv to convert among OpenOffice understood documents.""" from invenio.bibdocfile import normalize_format ## NOTE: in case we switch back keeping LibreOffice running, uncomment ## the following line. #_register_unoconv() input_file, output_file, dummy = prepare_io(input_file, output_file, output_format, need_working_dir=False) if output_format == 'txt': unoconv_format = 'text' else: unoconv_format = output_format try: try: ## We copy the input file and we make it available to OpenOffice ## with the user nobody from invenio.bibdocfile import decompose_file input_format = decompose_file(input_file, skip_version=True)[2] fd, tmpinputfile = tempfile.mkstemp(dir=CFG_TMPDIR, suffix=normalize_format(input_format)) os.close(fd) shutil.copy(input_file, tmpinputfile) get_file_converter_logger().debug("Prepared input file %s" % tmpinputfile) os.chmod(tmpinputfile, stat.S_IRUSR | stat.S_IWUSR | stat.S_IRGRP | stat.S_IROTH) tmpoutputfile = tempfile.mktemp(dir=CFG_OPENOFFICE_TMPDIR, suffix=normalize_format(output_format)) get_file_converter_logger().debug("Prepared output file %s" % tmpoutputfile) try: execute_command(os.path.join(CFG_BINDIR, 'inveniounoconv'), '-vvv', '-s', CFG_OPENOFFICE_SERVER_HOST, '-p', str(CFG_OPENOFFICE_SERVER_PORT), '--output', tmpoutputfile, '-f', unoconv_format, tmpinputfile, sudo=CFG_OPENOFFICE_USER) except: register_exception(alert_admin=True) raise except InvenioWebSubmitFileConverterError: ## Ok maybe OpenOffice hanged. Let's better kill it and restarted! if CFG_OPENOFFICE_SERVER_HOST != 'localhost': ## There's not that much that we can do. Let's bail out if not os.path.exists(tmpoutputfile) or not os.path.getsize(tmpoutputfile): raise else: ## Sometimes OpenOffice crashes but we don't care :-) ## it still have created a nice file. pass else: execute_command(os.path.join(CFG_BINDIR, 'inveniounoconv'), '-vvv', '-k', sudo=CFG_OPENOFFICE_USER) ## NOTE: in case we switch back keeping LibreOffice running, uncomment ## the following lines. #_unregister_unoconv() #_register_unoconv() time.sleep(5) try: execute_command(os.path.join(CFG_BINDIR, 'inveniounoconv'), '-vvv', '-s', CFG_OPENOFFICE_SERVER_HOST, '-p', str(CFG_OPENOFFICE_SERVER_PORT), '--output', tmpoutputfile, '-f', unoconv_format, tmpinputfile, sudo=CFG_OPENOFFICE_USER) except InvenioWebSubmitFileConverterError: execute_command(os.path.join(CFG_BINDIR, 'inveniounoconv'), '-vvv', '-k', sudo=CFG_OPENOFFICE_USER) if not os.path.exists(tmpoutputfile) or not os.path.getsize(tmpoutputfile): raise InvenioWebSubmitFileConverterError('No output was generated by OpenOffice') else: ## Sometimes OpenOffice crashes but we don't care :-) ## it still have created a nice file. pass except Exception, err: raise InvenioWebSubmitFileConverterError(get_unoconv_installation_guideline(err)) output_format = normalize_format(output_format) if output_format == '.pdf' and pdfopt: pdf2pdfopt(tmpoutputfile, output_file) else: shutil.copy(tmpoutputfile, output_file) execute_command(os.path.join(CFG_BINDIR, 'inveniounoconv'), '-r', tmpoutputfile, sudo=CFG_OPENOFFICE_USER) os.remove(tmpinputfile) return output_file def get_unoconv_installation_guideline(err): """Return the Libre/OpenOffice installation guideline (embedding the current error message). """ from invenio.bibtask import guess_apache_process_user return wrap_text_in_a_box("""\ OpenOffice.org can't properly create files in the OpenOffice.org temporary directory %(tmpdir)s, as the user %(nobody)s (as configured in CFG_OPENOFFICE_USER invenio(-local).conf variable): %(err)s. In your /etc/sudoers file, you should authorize the %(apache)s user to run %(unoconv)s as %(nobody)s user as in: %(apache)s ALL=(%(nobody)s) NOPASSWD: %(unoconv)s You should then run the following commands: $ sudo mkdir -p %(tmpdir)s $ sudo chown -R %(nobody)s %(tmpdir)s $ sudo chmod -R 755 %(tmpdir)s""" % { 'tmpdir' : CFG_OPENOFFICE_TMPDIR, 'nobody' : CFG_OPENOFFICE_USER, 'err' : err, 'apache' : CFG_BIBSCHED_PROCESS_USER or guess_apache_process_user(), 'python' : CFG_PATH_OPENOFFICE_PYTHON, 'unoconv' : os.path.join(CFG_BINDIR, 'inveniounoconv') }) def can_unoconv(verbose=False): """ If OpenOffice.org integration is enabled, checks whether the system is properly configured. """ if CFG_PATH_OPENOFFICE_PYTHON and CFG_OPENOFFICE_SERVER_HOST: try: test = os.path.join(CFG_TMPDIR, 'test.txt') open(test, 'w').write('test') output = unoconv(test, output_format='pdf') output2 = convert_file(output, output_format='.txt') if 'test' not in open(output2).read(): raise Exception("Coulnd't produce a valid PDF with Libre/OpenOffice.org") os.remove(output2) os.remove(output) os.remove(test) return True except Exception, err: if verbose: print >> sys.stderr, get_unoconv_installation_guideline(err) return False else: if verbose: print >> sys.stderr, "Libre/OpenOffice.org integration not enabled" return False def any2djvu(input_file, output_file=None, resolution=400, ocr=True, input_format=5, **dummy): """ Transform input_file into a .djvu file. @param input_file [string] the input file name @param output_file [string] the output_file file name, None for temporary generated @param resolution [int] the resolution of the output_file @param input_format [int] [1-9]: 1 - DjVu Document (for verification or OCR) 2 - PS/PS.GZ/PDF Document (default) 3 - Photo/Picture/Icon 4 - Scanned Document - B&W - <200 dpi 5 - Scanned Document - B&W - 200-400 dpi 6 - Scanned Document - B&W - >400 dpi 7 - Scanned Document - Color/Mixed - <200 dpi 8 - Scanned Document - Color/Mixed - 200-400 dpi 9 - Scanned Document - Color/Mixed - >400 dpi @return [string] output_file input_file. raise InvenioWebSubmitFileConverterError in case of errors. Note: due to the bottleneck of using a centralized server, it is very slow and is not suitable for interactive usage (e.g. WebSubmit functions) """ from invenio.bibdocfile import decompose_file input_file, output_file, working_dir = prepare_io(input_file, output_file, '.djvu') ocr = ocr and "1" or "0" ## Any2djvu expect to find the file in the current directory. execute_command(CFG_PATH_ANY2DJVU, '-a', '-c', '-r', resolution, '-o', ocr, '-f', input_format, os.path.basename(input_file), cwd=working_dir) ## Any2djvu doesn't let you choose the output_file file name. djvu_output = os.path.join(working_dir, decompose_file(input_file)[1] + '.djvu') shutil.move(djvu_output, output_file) clean_working_dir(working_dir) return output_file _RE_FIND_TITLE = re.compile(r'^Title:\s*(.*?)\s*$') def pdf2pdfx(input_file, output_file=None, title=None, pdfopt=False, profile="pdf/x-3:2002", **dummy): """ Transform any PDF into a PDF/X (see: ) @param input_file [string] the input file name @param output_file [string] the output_file file name, None for temporary generated @param title [string] the title of the document. None for autodiscovery. @param pdfopt [bool] whether to linearize the pdf, too. @param profile: [string] the PDFX profile to use. Supports: 'pdf/x-1a:2001', 'pdf/x-1a:2003', 'pdf/x-3:2002' @return [string] output_file input_file raise InvenioWebSubmitFileConverterError in case of errors. """ input_file, output_file, working_dir = prepare_io(input_file, output_file, '.pdf') if title is None: stdout = execute_command(CFG_PATH_PDFINFO, input_file) for line in stdout.split('\n'): g = _RE_FIND_TITLE.match(line) if g: title = g.group(1) break if not title: title = 'No title' get_file_converter_logger().debug("Extracted title is %s" % title) if os.path.exists(CFG_ICC_PATH): shutil.copy(CFG_ICC_PATH, working_dir) else: raise InvenioWebSubmitFileConverterError('ERROR: ISOCoatedsb.icc file missing. Have you run "make install-pdfa-helper-files" as part of your Invenio deployment?') pdfx_header = open(CFG_PDFX_DEF_PATH).read() pdfx_header = pdfx_header.replace('<<<>>>', title) icc_iso_profile_def = '' if profile == 'pdf/x-1a:2001': pdfx_version = 'PDF/X-1a:2001' pdfx_conformance = 'PDF/X-1a:2001' elif profile == 'pdf/x-1a:2003': pdfx_version = 'PDF/X-1a:2003' pdfx_conformance = 'PDF/X-1a:2003' elif profile == 'pdf/x-3:2002': icc_iso_profile_def = '/ICCProfile (ISOCoatedsb.icc)' pdfx_version = 'PDF/X-3:2002' pdfx_conformance = 'PDF/X-3:2002' pdfx_header = pdfx_header.replace('<<<>>>', icc_iso_profile_def) pdfx_header = pdfx_header.replace('<<<>>>', pdfx_version) pdfx_header = pdfx_header.replace('<<<>>>', pdfx_conformance) outputpdf = os.path.join(working_dir, 'output_file.pdf') open(os.path.join(working_dir, 'PDFX_def.ps'), 'w').write(pdfx_header) if profile in ['pdf/x-3:2002']: execute_command(CFG_PATH_GS, '-sProcessColorModel=DeviceCMYK', '-dPDFX', '-dBATCH', '-dNOPAUSE', '-dNOOUTERSAVE', '-dUseCIEColor', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sOutputFile=output_file.pdf', os.path.join(working_dir, 'PDFX_def.ps'), input_file, cwd=working_dir) elif profile in ['pdf/x-1a:2001', 'pdf/x-1a:2003']: execute_command(CFG_PATH_GS, '-sProcessColorModel=DeviceCMYK', '-dPDFX', '-dBATCH', '-dNOPAUSE', '-dNOOUTERSAVE', '-sColorConversionStrategy=CMYK', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sOutputFile=output_file.pdf', os.path.join(working_dir, 'PDFX_def.ps'), input_file, cwd=working_dir) if pdfopt: execute_command(CFG_PATH_PDFOPT, outputpdf, output_file) else: shutil.move(outputpdf, output_file) clean_working_dir(working_dir) return output_file def pdf2pdfa(input_file, output_file=None, title=None, pdfopt=True, **dummy): """ Transform any PDF into a PDF/A (see: ) @param input_file [string] the input file name @param output_file [string] the output_file file name, None for temporary generated @param title [string] the title of the document. None for autodiscovery. @param pdfopt [bool] whether to linearize the pdf, too. @return [string] output_file input_file raise InvenioWebSubmitFileConverterError in case of errors. """ input_file, output_file, working_dir = prepare_io(input_file, output_file, '.pdf') if title is None: stdout = execute_command(CFG_PATH_PDFINFO, input_file) for line in stdout.split('\n'): g = _RE_FIND_TITLE.match(line) if g: title = g.group(1) break if not title: title = 'No title' get_file_converter_logger().debug("Extracted title is %s" % title) if os.path.exists(CFG_ICC_PATH): shutil.copy(CFG_ICC_PATH, working_dir) else: raise InvenioWebSubmitFileConverterError('ERROR: ISOCoatedsb.icc file missing. Have you run "make install-pdfa-helper-files" as part of your Invenio deployment?') pdfa_header = open(CFG_PDFA_DEF_PATH).read() pdfa_header = pdfa_header.replace('<<<>>>', title) inputps = os.path.join(working_dir, 'input.ps') outputpdf = os.path.join(working_dir, 'output_file.pdf') open(os.path.join(working_dir, 'PDFA_def.ps'), 'w').write(pdfa_header) execute_command(CFG_PATH_PDFTOPS, '-level3', input_file, inputps) execute_command(CFG_PATH_GS, '-sProcessColorModel=DeviceCMYK', '-dPDFA', '-dBATCH', '-dNOPAUSE', '-dNOOUTERSAVE', '-dUseCIEColor', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sOutputFile=output_file.pdf', os.path.join(working_dir, 'PDFA_def.ps'), 'input.ps', cwd=working_dir) if pdfopt: execute_command(CFG_PATH_PDFOPT, outputpdf, output_file) else: shutil.move(outputpdf, output_file) clean_working_dir(working_dir) return output_file def pdf2pdfopt(input_file, output_file=None, **dummy): """ Linearize the input PDF in order to improve the web-experience when visualizing the document through the web. @param input_file [string] the input input_file @param output_file [string] the output_file file name, None for temporary generated @return [string] output_file input_file raise InvenioWebSubmitFileConverterError in case of errors. """ input_file, output_file, dummy = prepare_io(input_file, output_file, '.pdf', need_working_dir=False) execute_command(CFG_PATH_PDFOPT, input_file, output_file) return output_file def pdf2ps(input_file, output_file=None, level=2, compress=True, **dummy): """ Convert from Pdf to Postscript. """ if compress: suffix = '.ps.gz' else: suffix = '.ps' input_file, output_file, working_dir = prepare_io(input_file, output_file, suffix) execute_command(CFG_PATH_PDFTOPS, '-level%i' % level, input_file, os.path.join(working_dir, 'output.ps')) if compress: execute_command(CFG_PATH_GZIP, '-c', os.path.join(working_dir, 'output.ps'), filename_out=output_file) else: shutil.move(os.path.join(working_dir, 'output.ps'), output_file) clean_working_dir(working_dir) return output_file def ps2pdfx(input_file, output_file=None, title=None, pdfopt=False, profile="pdf/x-3:2002", **dummy): """ Transform any PS into a PDF/X (see: ) @param input_file [string] the input file name @param output_file [string] the output_file file name, None for temporary generated @param title [string] the title of the document. None for autodiscovery. @param pdfopt [bool] whether to linearize the pdf, too. @param profile: [string] the PDFX profile to use. Supports: 'pdf/x-1a:2001', 'pdf/x-1a:2003', 'pdf/x-3:2002' @return [string] output_file input_file raise InvenioWebSubmitFileConverterError in case of errors. """ input_file, output_file, working_dir = prepare_io(input_file, output_file, '.pdf') if input_file.endswith('.gz'): new_input_file = os.path.join(working_dir, 'input.ps') execute_command(CFG_PATH_GUNZIP, '-c', input_file, filename_out=new_input_file) input_file = new_input_file if not title: title = 'No title' shutil.copy(CFG_ICC_PATH, working_dir) pdfx_header = open(CFG_PDFX_DEF_PATH).read() pdfx_header = pdfx_header.replace('<<<>>>', title) icc_iso_profile_def = '' if profile == 'pdf/x-1a:2001': pdfx_version = 'PDF/X-1a:2001' pdfx_conformance = 'PDF/X-1a:2001' elif profile == 'pdf/x-1a:2003': pdfx_version = 'PDF/X-1a:2003' pdfx_conformance = 'PDF/X-1a:2003' elif profile == 'pdf/x-3:2002': icc_iso_profile_def = '/ICCProfile (ISOCoatedsb.icc)' pdfx_version = 'PDF/X-3:2002' pdfx_conformance = 'PDF/X-3:2002' pdfx_header = pdfx_header.replace('<<<>>>', icc_iso_profile_def) pdfx_header = pdfx_header.replace('<<<>>>', pdfx_version) pdfx_header = pdfx_header.replace('<<<>>>', title) outputpdf = os.path.join(working_dir, 'output_file.pdf') open(os.path.join(working_dir, 'PDFX_def.ps'), 'w').write(pdfx_header) if profile in ['pdf/x-3:2002']: execute_command(CFG_PATH_GS, '-sProcessColorModel=DeviceCMYK', '-dPDFX', '-dBATCH', '-dNOPAUSE', '-dNOOUTERSAVE', '-dUseCIEColor', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sOutputFile=output_file.pdf', os.path.join(working_dir, 'PDFX_def.ps'), 'input.ps', cwd=working_dir) elif profile in ['pdf/x-1a:2001', 'pdf/x-1a:2003']: execute_command(CFG_PATH_GS, '-sProcessColorModel=DeviceCMYK', '-dPDFX', '-dBATCH', '-dNOPAUSE', '-dNOOUTERSAVE', '-sColorConversionStrategy=CMYK', '-dAutoRotatePages=/None', '-sDEVICE=pdfwrite', '-sOutputFile=output_file.pdf', os.path.join(working_dir, 'PDFX_def.ps'), 'input.ps', cwd=working_dir) if pdfopt: execute_command(CFG_PATH_PDFOPT, outputpdf, output_file) else: shutil.move(outputpdf, output_file) clean_working_dir(working_dir) return output_file def ps2pdfa(input_file, output_file=None, title=None, pdfopt=True, **dummy): """ Transform any PS into a PDF/A (see: ) @param input_file [string] the input file name @param output_file [string] the output_file file name, None for temporary generated @param title [string] the title of the document. None for autodiscovery. @param pdfopt [bool] whether to linearize the pdf, too. @return [string] output_file input_file raise InvenioWebSubmitFileConverterError in case of errors. """ input_file, output_file, working_dir = prepare_io(input_file, output_file, '.pdf') if input_file.endswith('.gz'): new_input_file = os.path.join(working_dir, 'input.ps') execute_command(CFG_PATH_GUNZIP, '-c', input_file, filename_out=new_input_file) input_file = new_input_file if not title: title = 'No title' shutil.copy(CFG_ICC_PATH, working_dir) pdfa_header = open(CFG_PDFA_DEF_PATH).read() pdfa_header = pdfa_header.replace('<<<>>>', title) outputpdf = os.path.join(working_dir, 'output_file.pdf') open(os.path.join(working_dir, 'PDFA_def.ps'), 'w').write(pdfa_header) execute_command(CFG_PATH_GS, '-sProcessColorModel=DeviceCMYK', '-dPDFA', '-dBATCH', '-dNOPAUSE', '-dNOOUTERSAVE', '-dUseCIEColor', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sOutputFile=output_file.pdf', os.path.join(working_dir, 'PDFA_def.ps'), input_file, cwd=working_dir) if pdfopt: execute_command(CFG_PATH_PDFOPT, outputpdf, output_file) else: shutil.move(outputpdf, output_file) clean_working_dir(working_dir) return output_file def ps2pdf(input_file, output_file=None, pdfopt=True, **dummy): """ Transform any PS into a PDF @param input_file [string] the input file name @param output_file [string] the output_file file name, None for temporary generated @param pdfopt [bool] whether to linearize the pdf, too. @return [string] output_file input_file raise InvenioWebSubmitFileConverterError in case of errors. """ input_file, output_file, working_dir = prepare_io(input_file, output_file, '.pdf') if input_file.endswith('.gz'): new_input_file = os.path.join(working_dir, 'input.ps') execute_command(CFG_PATH_GUNZIP, '-c', input_file, filename_out=new_input_file) input_file = new_input_file outputpdf = os.path.join(working_dir, 'output_file.pdf') execute_command(CFG_PATH_PS2PDF, input_file, outputpdf, cwd=working_dir) if pdfopt: execute_command(CFG_PATH_PDFOPT, outputpdf, output_file) else: shutil.move(outputpdf, output_file) clean_working_dir(working_dir) return output_file def pdf2pdfhocr(input_pdf, text_hocr, output_pdf, rotations=None, font='Courier', draft=False): """ Adds the OCRed text to the original pdf. @param rotations: a list of angles by which pages should be rotated """ def _get_page_rotation(i): if len(rotations) > i: return rotations[i] return 0 if rotations is None: rotations = [] input_pdf, hocr_pdf, dummy = prepare_io(input_pdf, output_ext='.pdf', need_working_dir=False) create_pdf(extract_hocr(open(text_hocr).read()), hocr_pdf, font, draft) input1 = PdfFileReader(file(input_pdf, "rb")) input2 = PdfFileReader(file(hocr_pdf, "rb")) output = PdfFileWriter() info = input1.getDocumentInfo() if info: infoDict = output._info.getObject() infoDict.update(info) for i in range(0, input1.getNumPages()): orig_page = input1.getPage(i) text_page = input2.getPage(i) angle = _get_page_rotation(i) if angle != 0: print >> sys.stderr, "Rotating page %d by %d degrees." % (i, angle) text_page = text_page.rotateClockwise(angle) if draft: below, above = orig_page, text_page else: below, above = text_page, orig_page below.mergePage(above) if angle != 0 and not draft: print >> sys.stderr, "Rotating back page %d by %d degrees." % (i, angle) below.rotateCounterClockwise(angle) output.addPage(below) outputStream = file(output_pdf, "wb") output.write(outputStream) outputStream.close() os.remove(hocr_pdf) return output_pdf def pdf2hocr2pdf(input_file, output_file=None, ln='en', return_working_dir=False, extract_only_text=False, pdfopt=True, font='Courier', draft=False, **dummy): """ Return the text content in input_file. @param ln is a two letter language code to give the OCR tool a hint. @param return_working_dir if set to True, will return output_file path and the working_dir path, instead of deleting the working_dir. This is useful in case you need the intermediate images to build again a PDF. """ def _perform_rotate(working_dir, imagefile, angle): """Rotate imagefile of the corresponding angle. Creates a new file with rotated.ppm.""" get_file_converter_logger().debug('Performing rotate on %s by %s degrees' % (imagefile, angle)) if not angle: #execute_command('%s %s %s', CFG_PATH_CONVERT, os.path.join(working_dir, imagefile), os.path.join(working_dir, 'rotated-%s' % imagefile)) shutil.copy(os.path.join(working_dir, imagefile), os.path.join(working_dir, 'rotated.ppm')) else: execute_command(CFG_PATH_CONVERT, os.path.join(working_dir, imagefile), '-rotate', str(angle), '-depth', str(8), os.path.join(working_dir, 'rotated.ppm')) return True def _perform_deskew(working_dir): """Perform ocroscript deskew. Expect to work on rotated-imagefile. Creates deskewed.ppm. Return True if deskewing was fine.""" get_file_converter_logger().debug('Performing deskew') try: dummy, stderr = execute_command_with_stderr(CFG_PATH_OCROSCRIPT, os.path.join(CFG_ETCDIR, 'websubmit', 'file_converter_templates', 'deskew.lua'), os.path.join(working_dir, 'rotated.ppm'), os.path.join(working_dir, 'deskewed.ppm')) if stderr.strip(): get_file_converter_logger().debug('Errors found during deskewing') return False else: return True except InvenioWebSubmitFileConverterError, err: get_file_converter_logger().debug('Deskewing error: %s' % err) return False def _perform_recognize(working_dir): """Perform ocroscript recognize. Expect to work on deskewed.ppm. Creates recognized.out Return True if recognizing was fine.""" get_file_converter_logger().debug('Performing recognize') if extract_only_text: output_mode = 'text' else: output_mode = 'hocr' try: dummy, stderr = execute_command_with_stderr(CFG_PATH_OCROSCRIPT, 'recognize', '--tesslanguage=%s' % ln, '--output-mode=%s' % output_mode, os.path.join(working_dir, 'deskewed.ppm'), filename_out=os.path.join(working_dir, 'recognize.out')) if stderr.strip(): ## There was some output on stderr get_file_converter_logger().debug('Errors found in recognize.err') return False return not guess_ocropus_produced_garbage(os.path.join(working_dir, 'recognize.out'), not extract_only_text) except InvenioWebSubmitFileConverterError, err: get_file_converter_logger().debug('Recognizer error: %s' % err) return False def _perform_dummy_recognize(working_dir): """Return an empty text or an empty hocr referencing the image.""" get_file_converter_logger().debug('Performing dummy recognize') if extract_only_text: out = '' else: out = """ OCR Output
""" open(os.path.join(working_dir, 'recognize.out'), 'w').write(out) def _find_image_file(working_dir, imageprefix, page): ret = '%s-%d.ppm' % (imageprefix, page) if os.path.exists(os.path.join(working_dir, ret)): return ret ret = '%s-%02d.ppm' % (imageprefix, page) if os.path.exists(os.path.join(working_dir, ret)): return ret ret = '%s-%03d.ppm' % (imageprefix, page) if os.path.exists(os.path.join(working_dir, ret)): return ret ret = '%s-%04d.ppm' % (imageprefix, page) if os.path.exists(os.path.join(working_dir, ret)): return ret ret = '%s-%05d.ppm' % (imageprefix, page) if os.path.exists(os.path.join(working_dir, ret)): return ret ret = '%s-%06d.ppm' % (imageprefix, page) if os.path.exists(os.path.join(working_dir, ret)): return ret ## I guess we won't have documents with more than million pages return None def _ocr(tmp_output_file): """ Append to tmp_output_file the partial results of OCROpus recognize. Return a list of rotations. """ page = 0 rotations = [] while True: page += 1 get_file_converter_logger().debug('Page %d.' % page) execute_command(CFG_PATH_PDFTOPPM, '-f', str(page), '-l', str(page), '-r', str(CFG_PPM_RESOLUTION), '-aa', 'yes', '-freetype', 'yes', input_file, os.path.join(working_dir, 'image')) imagefile = _find_image_file(working_dir, 'image', page) if imagefile == None: break for angle in (0, 180, 90, 270): get_file_converter_logger().debug('Trying %d degrees...' % angle) if _perform_rotate(working_dir, imagefile, angle) and _perform_deskew(working_dir) and _perform_recognize(working_dir): rotations.append(angle) break else: get_file_converter_logger().debug('Dummy recognize') rotations.append(0) _perform_dummy_recognize(working_dir) open(tmp_output_file, 'a').write(open(os.path.join(working_dir, 'recognize.out')).read()) # clean os.remove(os.path.join(working_dir, imagefile)) return rotations if CFG_PATH_OCROSCRIPT: if len(ln) == 2: ln = CFG_TWO2THREE_LANG_CODES.get(ln, 'eng') if extract_only_text: input_file, output_file, working_dir = prepare_io(input_file, output_file, output_ext='.txt') _ocr(output_file) else: input_file, tmp_output_hocr, working_dir = prepare_io(input_file, output_ext='.hocr') rotations = _ocr(tmp_output_hocr) if pdfopt: input_file, tmp_output_pdf, dummy = prepare_io(input_file, output_ext='.pdf', need_working_dir=False) tmp_output_pdf, output_file, dummy = prepare_io(tmp_output_pdf, output_file, output_ext='.pdf', need_working_dir=False) pdf2pdfhocr(input_file, tmp_output_hocr, tmp_output_pdf, rotations=rotations, font=font, draft=draft) pdf2pdfopt(tmp_output_pdf, output_file) os.remove(tmp_output_pdf) else: input_file, output_file, dummy = prepare_io(input_file, output_file, output_ext='.pdf', need_working_dir=False) pdf2pdfhocr(input_file, tmp_output_hocr, output_file, rotations=rotations, font=font, draft=draft) clean_working_dir(working_dir) return output_file else: raise InvenioWebSubmitFileConverterError("It's impossible to generate HOCR output from PDF. OCROpus is not available.") def pdf2text(input_file, output_file=None, perform_ocr=True, ln='en', **dummy): """ Return the text content in input_file. """ input_file, output_file, dummy = prepare_io(input_file, output_file, '.txt', need_working_dir=False) execute_command(CFG_PATH_PDFTOTEXT, '-enc', 'UTF-8', '-eol', 'unix', '-nopgbrk', input_file, output_file) if perform_ocr and can_perform_ocr(): ocred_output = pdf2hocr2pdf(input_file, ln=ln, extract_only_text=True) try: output = open(output_file, 'a') for row in open(ocred_output): output.write(row) output.close() finally: silent_remove(ocred_output) return output_file def txt2text(input_file, output_file=None, **dummy): """ Return the text content in input_file """ input_file, output_file, dummy = prepare_io(input_file, output_file, '.txt', need_working_dir=False) shutil.copy(input_file, output_file) return output_file def html2text(input_file, output_file=None, **dummy): """ Return the text content of an HTML/XML file. """ class HTMLStripper(HTMLParser.HTMLParser): def __init__(self, output_file): HTMLParser.HTMLParser.__init__(self) self.output_file = output_file def handle_entityref(self, name): if name in entitydefs: self.output_file.write(entitydefs[name].decode('latin1').encode('utf8')) def handle_data(self, data): if data.strip(): self.output_file.write(_RE_CLEAN_SPACES.sub(' ', data)) def handle_charref(self, data): try: self.output_file.write(unichr(int(data)).encode('utf8')) except: pass def close(self): self.output_file.close() HTMLParser.HTMLParser.close(self) input_file, output_file, dummy = prepare_io(input_file, output_file, '.txt', need_working_dir=False) html_stripper = HTMLStripper(open(output_file, 'w')) for line in open(input_file): html_stripper.feed(line) html_stripper.close() return output_file def djvu2text(input_file, output_file=None, **dummy): """ Return the text content in input_file. """ input_file, output_file, dummy = prepare_io(input_file, output_file, '.txt', need_working_dir=False) execute_command(CFG_PATH_DJVUTXT, input_file, output_file) return output_file def djvu2ps(input_file, output_file=None, level=2, compress=True, **dummy): """ Convert a djvu into a .ps[.gz] """ if compress: input_file, output_file, working_dir = prepare_io(input_file, output_file, output_ext='.ps.gz') try: execute_command(CFG_PATH_DJVUPS, input_file, os.path.join(working_dir, 'output.ps')) execute_command(CFG_PATH_GZIP, '-c', os.path.join(working_dir, 'output.ps'), filename_out=output_file) finally: clean_working_dir(working_dir) else: try: input_file, output_file, working_dir = prepare_io(input_file, output_file, output_ext='.ps') execute_command(CFG_PATH_DJVUPS, '-level=%i' % level, input_file, output_file) finally: clean_working_dir(working_dir) return output_file def tiff2pdf(input_file, output_file=None, pdfopt=True, pdfa=True, perform_ocr=True, **args): """ Convert a .tiff into a .pdf """ if pdfa or pdfopt or perform_ocr: input_file, output_file, working_dir = prepare_io(input_file, output_file, '.pdf') try: partial_output = os.path.join(working_dir, 'output.pdf') execute_command(CFG_PATH_TIFF2PDF, '-o', partial_output, input_file) if perform_ocr: pdf2hocr2pdf(partial_output, output_file, pdfopt=pdfopt, **args) elif pdfa: pdf2pdfa(partial_output, output_file, pdfopt=pdfopt, **args) else: pdfopt(partial_output, output_file) finally: clean_working_dir(working_dir) else: input_file, output_file, dummy = prepare_io(input_file, output_file, '.pdf', need_working_dir=False) execute_command(CFG_PATH_TIFF2PDF, '-o', output_file, input_file) return output_file def pstotext(input_file, output_file=None, **dummy): """ Convert a .ps[.gz] into text. """ input_file, output_file, working_dir = prepare_io(input_file, output_file, '.txt') try: if input_file.endswith('.gz'): new_input_file = os.path.join(working_dir, 'input.ps') execute_command(CFG_PATH_GUNZIP, '-c', input_file, filename_out=new_input_file) input_file = new_input_file execute_command(CFG_PATH_PSTOTEXT, '-output', output_file, input_file) finally: clean_working_dir(working_dir) return output_file def gzip(input_file, output_file=None, **dummy): """ Compress a file. """ input_file, output_file, dummy = prepare_io(input_file, output_file, '.gz', need_working_dir=False) execute_command(CFG_PATH_GZIP, '-c', input_file, filename_out=output_file) return output_file def gunzip(input_file, output_file=None, **dummy): """ Uncompress a file. """ from invenio.bibdocfile import decompose_file input_ext = decompose_file(input_file, skip_version=True)[2] if input_ext.endswith('.gz'): input_ext = input_ext[:-len('.gz')] else: input_ext = None input_file, output_file, dummy = prepare_io(input_file, output_file, input_ext, need_working_dir=False) execute_command(CFG_PATH_GUNZIP, '-c', input_file, filename_out=output_file) return output_file def prepare_io(input_file, output_file=None, output_ext=None, need_working_dir=True): """Clean input_file and the output_file.""" from invenio.bibdocfile import decompose_file, normalize_format output_ext = normalize_format(output_ext) get_file_converter_logger().debug('Preparing IO for input=%s, output=%s, output_ext=%s' % (input_file, output_file, output_ext)) if output_ext is None: if output_file is None: output_ext = '.tmp' else: output_ext = decompose_file(output_file, skip_version=True)[2] if output_file is None: try: (fd, output_file) = tempfile.mkstemp(suffix=output_ext, dir=CFG_TMPDIR) os.close(fd) except IOError, err: raise InvenioWebSubmitFileConverterError("It's impossible to create a temporary file: %s" % err) else: output_file = os.path.abspath(output_file) if os.path.exists(output_file): os.remove(output_file) if need_working_dir: try: working_dir = tempfile.mkdtemp(dir=CFG_TMPDIR, prefix='conversion') except IOError, err: raise InvenioWebSubmitFileConverterError("It's impossible to create a temporary directory: %s" % err) input_ext = decompose_file(input_file, skip_version=True)[2] new_input_file = os.path.join(working_dir, 'input' + input_ext) shutil.copy(input_file, new_input_file) input_file = new_input_file else: working_dir = None input_file = os.path.abspath(input_file) get_file_converter_logger().debug('IO prepared: input_file=%s, output_file=%s, working_dir=%s' % (input_file, output_file, working_dir)) return (input_file, output_file, working_dir) def clean_working_dir(working_dir): """ Remove the working_dir. """ get_file_converter_logger().debug('Cleaning working_dir: %s' % working_dir) shutil.rmtree(working_dir) def execute_command(*args, **argd): """Wrapper to run_process_with_timeout.""" get_file_converter_logger().debug("Executing: %s" % (args, )) args = [str(arg) for arg in args] res, stdout, stderr = run_process_with_timeout(args, cwd=argd.get('cwd'), filename_out=argd.get('filename_out'), filename_err=argd.get('filename_err'), sudo=argd.get('sudo')) get_file_converter_logger().debug('res: %s, stdout: %s, stderr: %s' % (res, stdout, stderr)) if res != 0: message = "ERROR: Error in running %s\n stdout:\n%s\nstderr:\n%s\n" % (args, stdout, stderr) get_file_converter_logger().error(message) raise InvenioWebSubmitFileConverterError(message) return stdout def execute_command_with_stderr(*args, **argd): """Wrapper to run_process_with_timeout.""" get_file_converter_logger().debug("Executing: %s" % (args, )) res, stdout, stderr = run_process_with_timeout(args, cwd=argd.get('cwd'), filename_out=argd.get('filename_out'), sudo=argd.get('sudo')) if res != 0: message = "ERROR: Error in running %s\n stdout:\n%s\nstderr:\n%s\n" % (args, stdout, stderr) get_file_converter_logger().error(message) raise InvenioWebSubmitFileConverterError(message) return stdout, stderr def silent_remove(path): """Remove without errors a path.""" if os.path.exists(path): try: os.remove(path) except OSError: pass __CONVERSION_MAP = get_conversion_map() def main_cli(): """ main function when the library behaves as a normal CLI tool. """ from invenio.bibdocfile import normalize_format parser = OptionParser() parser.add_option("-c", "--convert", dest="input_name", help="convert the specified FILE", metavar="FILE") parser.add_option("-d", "--debug", dest="debug", action="store_true", help="Enable debug information") parser.add_option("--special-pdf2hocr2pdf", dest="ocrize", help="convert the given scanned PDF into a PDF with OCRed text", metavar="FILE") parser.add_option("-f", "--format", dest="output_format", help="the desired output format", metavar="FORMAT") parser.add_option("-o", "--output", dest="output_name", help="the desired output FILE (if not specified a new file will be generated with the desired output format)") parser.add_option("--without-pdfa", action="store_false", dest="pdf_a", default=True, help="don't force creation of PDF/A PDFs") parser.add_option("--without-pdfopt", action="store_false", dest="pdfopt", default=True, help="don't force optimization of PDFs files") parser.add_option("--without-ocr", action="store_false", dest="ocr", default=True, help="don't force OCR") parser.add_option("--can-convert", dest="can_convert", help="display all the possible format that is possible to generate from the given format", metavar="FORMAT") parser.add_option("--is-ocr-needed", dest="check_ocr_is_needed", help="check if OCR is needed for the FILE specified", metavar="FILE") parser.add_option("-t", "--title", dest="title", help="specify the title (used when creating PDFs)", metavar="TITLE") parser.add_option("-l", "--language", dest="ln", help="specify the language (used when performing OCR, e.g. en, it, fr...)", metavar="LN", default='en') (options, dummy) = parser.parse_args() if options.debug: from logging import basicConfig basicConfig() get_file_converter_logger().setLevel(DEBUG) if options.can_convert: if options.can_convert: input_format = normalize_format(options.can_convert) if input_format == '.pdf': if can_pdfopt(True): print "PDF linearization supported" else: print "No PDF linearization support" if can_pdfa(True): print "PDF/A generation supported" else: print "No PDF/A generation support" if can_perform_ocr(True): print "OCR supported" else: print "OCR not supported" print 'Can convert from "%s" to:' % input_format[1:], for output_format in __CONVERSION_MAP: if can_convert(input_format, output_format): print '"%s"' % output_format[1:], print elif options.check_ocr_is_needed: print "Checking if OCR is needed on %s..." % options.check_ocr_is_needed, sys.stdout.flush() if guess_is_OCR_needed(options.check_ocr_is_needed): print "needed." else: print "not needed." elif options.ocrize: try: output = pdf2hocr2pdf(options.ocrize, output_file=options.output_name, title=options.title, ln=options.ln) print "Output stored in %s" % output except InvenioWebSubmitFileConverterError, err: print "ERROR: %s" % err sys.exit(1) else: try: if not options.output_name and not options.output_format: parser.error("Either --format, --output should be specified") if not options.input_name: parser.error("An input should be specified!") output = convert_file(options.input_name, output_file=options.output_name, output_format=options.output_format, pdfopt=options.pdfopt, pdfa=options.pdf_a, title=options.title, ln=options.ln) print "Output stored in %s" % output except InvenioWebSubmitFileConverterError, err: print "ERROR: %s" % err sys.exit(1) if __name__ == "__main__": main_cli()