diff --git a/INSTALL b/INSTALL index ca90fedb3..d106a8a5b 100644 --- a/INSTALL +++ b/INSTALL @@ -1,864 +1,863 @@ Invenio INSTALLATION ==================== About ===== This document specifies how to build, customize, and install Invenio v1.1.4 for the first time. See RELEASE-NOTES if you are upgrading from a previous Invenio release. Contents ======== 0. Prerequisites 1. Quick instructions for the impatient Invenio admin 2. Detailed instructions for the patient Invenio admin 0. Prerequisites ================ Here is the software you need to have around before you start installing Invenio: a) Unix-like operating system. The main development and production platforms for Invenio at CERN are GNU/Linux distributions Debian, Gentoo, Scientific Linux (aka RHEL), Ubuntu, but we also develop on Mac OS X. Basically any Unix system supporting the software listed below should do. If you are using Debian GNU/Linux ``Lenny'' or later, then you can install most of the below-mentioned prerequisites and recommendations by running: $ sudo aptitude install python-dev apache2-mpm-prefork \ mysql-server mysql-client python-mysqldb \ python-4suite-xml python-simplejson python-xml \ gnuplot poppler-utils \ gs-common clisp gettext libapache2-mod-wsgi unzip \ python-dateutil python-rdflib python-pyparsing \ python-gnuplot python-magic pdftk html2text giflib-tools \ pstotext netpbm python-pypdf python-chardet python-lxml \ python-unidecode redis-server python-redis You may also want to install some of the following packages, if you have them available on your concrete architecture: $ sudo aptitude install sbcl cmucl pylint pychecker pyflakes \ python-profiler python-epydoc libapache2-mod-xsendfile \ openoffice.org python-utidylib python-beautifulsoup \ - python-unidecode libhdf5-dev + libhdf5-dev (Note that if you use pip to manage your Python dependencies instead of operating system packages, please see the section (d) below on how to use pip instead of aptitude.) Moreover, you should install some Message Transfer Agent (MTA) such as Postfix so that Invenio can email notification alerts or registration information to the end users, contact moderators and reviewers of submitted documents, inform administrators about various runtime system information, etc: $ sudo aptitude install postfix After running the above-quoted aptitude command(s), you can proceed to configuring your MySQL server instance (max_allowed_packet in my.cnf, see item 0b below) and then to installing the Invenio software package in the section 1 below. If you are using another operating system, then please continue reading the rest of this prerequisites section, and please consult our wiki pages for any concrete hints for your specific operating system. b) MySQL server (may be on a remote machine), and MySQL client (must be available locally too). MySQL versions 4.1 or 5.0 are supported. Please set the variable "max_allowed_packet" in your "my.cnf" init file to at least 4M. (For sites such as INSPIRE, having 1M records with 10M citer-citee pairs in its citation map, you may need to increase max_allowed_packet to 1G.) You may perhaps also want to run your MySQL server natively in UTF-8 mode by setting "default-character-set=utf8" in various parts of your "my.cnf" file, such as in the "[mysql]" part and elsewhere; but this is not really required. c) Redis server (may be on a remote machine) for user session management and caching purposes. By default, Invenio would use Redis to store sessions, so it is highly recommended to install it. However, if you do not want to use Redis, you can change CFG_WEBSESSION_STORAGE settings in invenio-local.conf and MySQL will be used for session management instead. d) Apache 2 server, with support for loading DSO modules, and optionally with SSL support for HTTPS-secure user authentication, and mod_xsendfile for off-loading file downloads away from Invenio processes to Apache. e) Python v2.6 or above: as well as the following Python modules: - (mandatory) MySQLdb (version >= 1.2.1_p2; see below) - (mandatory) Pyparsing, for document parsing + - (mandatory) unidecode, for ASCII representation of Unicode text: + - (recommended) Redis connector: - (recommended) Nydus, Redis consistent hashing connector: - (recommended) python-dateutil, for complex date processing: - (recommended) PyXML, for XML processing: - (recommended) PyRXP, for very fast XML MARC processing: - (recommended) lxml, for XML/XLST processing: - (recommended) Gnuplot.Py, for producing graphs: - (recommended) Snowball Stemmer, for stemming: - (recommended) py-editdist, for record merging: - (recommended) numpy, for citerank methods: - (recommended) magic, for full-text file handling: - (recommended) cerberus, extensible validation for Python dictionaries. - (optional) libxml2-python, for XML/XLST processing: - (optional) chardet, for character encoding detection: - (optional) 4suite, slower alternative to PyRXP and libxml2-python: - (optional) feedparser, for web journal creation: - (optional) RDFLib, to use RDF ontologies and thesauri: - (optional) mechanize, to run regression web test suite: - (optional) python-mock, mocking library for the test suite: - (optional) utidylib, for HTML washing: - (optional) Beautiful Soup, for HTML washing: - (optional) Python Twitter (and its dependencies) if you want to use the Twitter Fetcher bibtasklet: - (optional) Python OpenID if you want to enable OpenID support for authentication: - (optional) Python Rauth if you want to enable OAuth 1.0/2.0 support for authentication (depends on Python-2.6 or later): - - (optional) unidecode, for ASCII representation of Unicode - text: - - (optional) libhdf5-7, libhdf5-dev, python-h5py, in order to run author disambiguation. Note that if you are using pip to install and manage your Python dependencies, then you can run: $ sudo pip install -r requirements.txt $ sudo pip install -r requirements-extras.txt to install all manadatory, recommended, and optional packages mentioned above. f) mod_wsgi Apache module. Versions 3.x and above are recommended. g) If you want to be able to extract references from PDF fulltext files, then you need to install pdftotext version 3 at least. h) If you want to be able to search for words in the fulltext files (i.e. to have fulltext indexing) or to stamp submitted files, then you need as well to install some of the following tools: - for Microsoft Office/OpenOffice.org document conversion: OpenOffice.org - for PDF file stamping: pdftk, pdf2ps - for PDF files: pdftotext or pstotext - for PostScript files: pstotext or ps2ascii - for DjVu creation, elaboration: DjVuLibre - to perform OCR: OCRopus (tested only with release 0.3.1) - to perform different image elaborations: ImageMagick - to generate PDF after OCR: netpbm, ReportLab and pyPdf or pyPdf2 i) If you have chosen to install fast XML MARC Python processors in the step d) above, then you have to install the parsers themselves: - (optional) 4suite: j) (recommended) Gnuplot, the command-line driven interactive plotting program. It is used to display download and citation history graphs on the Detailed record pages on the web interface. Note that Gnuplot must be compiled with PNG output support, that is, with the GD library. Note also that Gnuplot is not required, only recommended. k) (recommended) A Common Lisp implementation, such as CLISP, SBCL or CMUCL. It is used for the web server log analysing tool and the metadata checking program. Note that any of the three implementations CLISP, SBCL, or CMUCL will do. CMUCL produces fastest machine code, but it does not support UTF-8 yet. Pick up CLISP if you don't know what to do. Note that a Common Lisp implementation is not required, only recommended. l) GNU gettext, a set of tools that makes it possible to translate the application in multiple languages. This is available by default on many systems. m) (recommended) xlwt 0.7.2, Library to create spreadsheet files compatible with MS Excel 97/2000/XP/2003 XLS files, on any platform, with Python 2.3 to 2.6 n) (recommended) matplotlib 1.0.0 is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. matplotlib can be used in python scripts, the python and ipython shell (ala MATLAB® or Mathematica®), web application servers, and six graphical user interface toolkits. It is used to generate pie graphs in the custom summary query (WebStat) o) (optional) FFmpeg, an open-source tools an libraries collection to convert video and audio files. It makes use of both internal as well as external libraries to generate videos for the web, such as Theora, WebM and H.264 out of almost any thinkable video input. FFmpeg is needed to run video related modules and submission workflows in Invenio. The minimal configuration of ffmpeg for the Invenio demo site requires a number of external libraries. It is highly recommended to remove all installed versions and packages that are comming with various Linux distributions and install the latest versions from sources. Additionally, you will need the Mediainfo Library for multimedia metadata handling. Minimum libraries for the demo site: - the ffmpeg multimedia encoder tools - a library for jpeg images needed for thumbnail extraction - a library for the ogg container format, needed for Vorbis and Theora - the OGG Vorbis audi codec library - the OGG Theora video codec library - the WebM video codec library - the mediainfo library for multimedia metadata Recommended for H.264 video (!be aware of licensing issues!): - a library for H.264 video encoding - a library for Advanced Audi Coding - a library for MP3 encoding Note that the configure script checks whether you have all the prerequisite software installed and that it won't let you continue unless everything is in order. It also warns you if it cannot find some optional but recommended software. 1. Quick instructions for the impatient Invenio admin ========================================================= 1a. Installation ---------------- $ cd $HOME/src/ $ wget http://invenio-software.org/download/invenio-1.1.4.tar.gz $ wget http://invenio-software.org/download/invenio-1.1.4.tar.gz.md5 $ wget http://invenio-software.org/download/invenio-1.1.4.tar.gz.sig $ md5sum -c invenio-1.1.4.tar.gz.md5 $ gpg --verify invenio-1.1.4.tar.gz.sig invenio-1.1.4.tar.gz $ tar xvfz invenio-1.1.4.tar.gz $ cd invenio-1.1.4 $ ./configure $ make $ make install $ make install-mathjax-plugin ## optional $ make install-jquery-plugins ## optional $ make install-ckeditor-plugin ## optional $ make install-pdfa-helper-files ## optional $ make install-mediaelement ## optional $ make install-solrutils ## optional $ make install-js-test-driver ## optional 1b. Configuration ----------------- $ sudo chown -R www-data.www-data /opt/invenio $ sudo -u www-data emacs /opt/invenio/etc/invenio-local.conf $ sudo -u www-data /opt/invenio/bin/inveniocfg --update-all $ sudo -u www-data /opt/invenio/bin/inveniocfg --create-tables $ sudo -u www-data /opt/invenio/bin/inveniocfg --load-bibfield-conf $ sudo -u www-data /opt/invenio/bin/inveniocfg --load-webstat-conf $ sudo -u www-data /opt/invenio/bin/inveniocfg --create-apache-conf $ sudo /etc/init.d/apache2 restart $ sudo -u www-data /opt/invenio/bin/inveniocfg --check-openoffice $ sudo -u www-data /opt/invenio/bin/inveniocfg --create-demo-site $ sudo -u www-data /opt/invenio/bin/inveniocfg --load-demo-records $ sudo -u www-data /opt/invenio/bin/inveniocfg --run-unit-tests $ sudo -u www-data /opt/invenio/bin/inveniocfg --run-regression-tests $ sudo -u www-data /opt/invenio/bin/inveniocfg --run-web-tests $ sudo -u www-data /opt/invenio/bin/inveniocfg --remove-demo-records $ sudo -u www-data /opt/invenio/bin/inveniocfg --drop-demo-site $ firefox http://your.site.com/help/admin/howto-run 2. Detailed instructions for the patient Invenio admin ========================================================== 2a. Installation ---------------- The Invenio uses standard GNU autoconf method to build and install its files. This means that you proceed as follows: $ cd $HOME/src/ Change to a directory where we will build the Invenio sources. (The built files will be installed into different "target" directories later.) $ wget http://invenio-software.org/download/invenio-1.1.4.tar.gz $ wget http://invenio-software.org/download/invenio-1.1.4.tar.gz.md5 $ wget http://invenio-software.org/download/invenio-1.1.4.tar.gz.sig Fetch Invenio source tarball from the distribution server, together with MD5 checksum and GnuPG cryptographic signature files useful for verifying the integrity of the tarball. $ md5sum -c invenio-1.1.4.tar.gz.md5 Verify MD5 checksum. $ gpg --verify invenio-1.1.4.tar.gz.sig invenio-1.1.4.tar.gz Verify GnuPG cryptographic signature. Note that you may first have to import my public key into your keyring, if you haven't done that already: $ gpg --keyserver pool.sks-keyservers.net --recv-key 0xBA5A2B67 The output of the gpg --verify command should then read: Good signature from "Tibor Simko " You can safely ignore any trusted signature certification warning that may follow after the signature has been successfully verified. $ tar xvfz invenio-1.1.4.tar.gz Untar the distribution tarball. $ cd invenio-1.1.4 Go to the source directory. $ ./configure Configure Invenio software for building on this specific platform. You can use the following optional parameters: --prefix=/opt/invenio Optionally, specify the Invenio general installation directory (default is /opt/invenio). It will contain command-line binaries and program libraries containing the core Invenio functionality, but also store web pages, runtime log and cache information, document data files, etc. Several subdirs like `bin', `etc', `lib', or `var' will be created inside the prefix directory to this effect. Note that the prefix directory should be chosen outside of the Apache htdocs tree, since only one its subdirectory (prefix/var/www) is to be accessible directly via the Web (see below). Note that Invenio won't install to any other directory but to the prefix mentioned in this configuration line. --with-python=/opt/python/bin/python2.7 Optionally, specify a path to some specific Python binary. This is useful if you have more than one Python installation on your system. If you don't set this option, then the first Python that will be found in your PATH will be chosen for running Invenio. --with-mysql=/opt/mysql/bin/mysql Optionally, specify a path to some specific MySQL client binary. This is useful if you have more than one MySQL installation on your system. If you don't set this option, then the first MySQL client executable that will be found in your PATH will be chosen for running Invenio. --with-clisp=/opt/clisp/bin/clisp Optionally, specify a path to CLISP executable. This is useful if you have more than one CLISP installation on your system. If you don't set this option, then the first executable that will be found in your PATH will be chosen for running Invenio. --with-cmucl=/opt/cmucl/bin/lisp Optionally, specify a path to CMUCL executable. This is useful if you have more than one CMUCL installation on your system. If you don't set this option, then the first executable that will be found in your PATH will be chosen for running Invenio. --with-sbcl=/opt/sbcl/bin/sbcl Optionally, specify a path to SBCL executable. This is useful if you have more than one SBCL installation on your system. If you don't set this option, then the first executable that will be found in your PATH will be chosen for running Invenio. --with-openoffice-python Optionally, specify the path to the Python interpreter embedded with OpenOffice.org. This is normally not contained in the normal path. If you don't specify this it won't be possible to use OpenOffice.org to convert from and to Microsoft Office and OpenOffice.org documents. This configuration step is mandatory. Usually, you do this step only once. (Note that if you are building Invenio not from a released tarball, but from the Git sources, then you have to generate the configure file via autotools: $ sudo aptitude install automake1.9 autoconf $ aclocal-1.9 $ automake-1.9 -a $ autoconf after which you proceed with the usual configure command.) $ make Launch the Invenio build. Since many messages are printed during the build process, you may want to run it in a fast-scrolling terminal such as rxvt or in a detached screen session. During this step all the pages and scripts will be pre-created and customized based on the config you have edited in the previous step. Note that on systems such as FreeBSD or Mac OS X you have to use GNU make ("gmake") instead of "make". $ make install Install the web pages, scripts, utilities and everything needed for Invenio runtime into respective installation directories, as specified earlier by the configure command. Note that if you are installing Invenio for the first time, you will be asked to create symbolic link(s) from Python's site-packages system-wide directory(ies) to the installation location. This is in order to instruct Python where to find Invenio's Python files. You will be hinted as to the exact command to use based on the parameters you have used in the configure command. $ make install-mathjax-plugin ## optional This will automatically download and install in the proper place MathJax, a JavaScript library to render LaTeX formulas in the client browser. Note that in order to enable the rendering you will have to set the variable CFG_WEBSEARCH_USE_MATHJAX_FOR_FORMATS in invenio-local.conf to a suitable list of output format codes. For example: CFG_WEBSEARCH_USE_MATHJAX_FOR_FORMATS = hd,hb $ make install-jquery-plugins ## optional This will automatically download and install in the proper place jQuery and related plugins. They are used for AJAX applications such as the record editor. Note that `unzip' is needed when installing jquery plugins. $ make install-ckeditor-plugin ## optional This will automatically download and install in the proper place CKeditor, a WYSIWYG Javascript-based editor (e.g. for the WebComment module). Note that in order to enable the editor you have to set the CFG_WEBCOMMENT_USE_RICH_EDITOR to True. $ make install-pdfa-helper-files ## optional This will automatically download and install in the proper place the helper files needed to create PDF/A files out of existing PDF files. $ make install-mediaelement ## optional This will automatically download and install the MediaElementJS HTML5 video player that is needed for videos on the DEMO site. $ make install-solrutils ## optional This will automatically download and install a Solr instance which can be used for full-text searching. See CFG_SOLR_URL variable in the invenio.conf. Note that the admin later has to take care of running init.d scripts which would start the Solr instance automatically. $ make install-js-test-driver ## optional This will automatically download and install JsTestDriver which is needed to run JS unit tests. Recommended for developers. 2b. Configuration ----------------- Once the basic software installation is done, we proceed to configuring your Invenio system. $ sudo chown -R www-data.www-data /opt/invenio For the sake of simplicity, let us assume that your Invenio installation will run under the `www-data' user process identity. The above command changes ownership of installed files to www-data, so that we shall run everything under this user identity from now on. For production purposes, you would typically enable Apache server to read all files from the installation place but to write only to the `var' subdirectory of your installation place. You could achieve this by configuring Unix directory group permissions, for example. $ sudo -u www-data emacs /opt/invenio/etc/invenio-local.conf Customize your Invenio installation. Please read the 'invenio.conf' file located in the same directory that contains the vanilla default configuration parameters of your Invenio installation. If you want to customize some of these parameters, you should create a file named 'invenio-local.conf' in the same directory where 'invenio.conf' lives and you should write there only the customizations that you want to be different from the vanilla defaults. Here is a realistic, minimalist, yet production-ready example of what you would typically put there: $ cat /opt/invenio/etc/invenio-local.conf [Invenio] CFG_SITE_NAME = John Doe's Document Server CFG_SITE_NAME_INTL_fr = Serveur des Documents de John Doe CFG_SITE_URL = http://your.site.com CFG_SITE_SECURE_URL = https://your.site.com CFG_SITE_ADMIN_EMAIL = john.doe@your.site.com CFG_SITE_SUPPORT_EMAIL = john.doe@your.site.com CFG_WEBALERT_ALERT_ENGINE_EMAIL = john.doe@your.site.com CFG_WEBCOMMENT_ALERT_ENGINE_EMAIL = john.doe@your.site.com CFG_WEBCOMMENT_DEFAULT_MODERATOR = john.doe@your.site.com CFG_BIBAUTHORID_AUTHOR_TICKET_ADMIN_EMAIL = john.doe@your.site.com CFG_BIBCATALOG_SYSTEM_EMAIL_ADDRESS = john.doe@your.site.com CFG_DATABASE_HOST = localhost CFG_DATABASE_NAME = invenio CFG_DATABASE_USER = invenio CFG_DATABASE_PASS = my123p$ss CFG_BIBDOCFILE_ENABLE_BIBDOCFSINFO_CACHE = 1 You should override at least the parameters mentioned above in order to define some very essential runtime parameters such as the name of your document server (CFG_SITE_NAME and CFG_SITE_NAME_INTL_*), the visible URL of your document server (CFG_SITE_URL and CFG_SITE_SECURE_URL), the email address of the local Invenio administrator, comment moderator, and alert engine (CFG_SITE_SUPPORT_EMAIL, CFG_SITE_ADMIN_EMAIL, etc), and last but not least your database credentials (CFG_DATABASE_*). If this is a first installation of Invenio it is recommended you set the CFG_BIBDOCFILE_ENABLE_BIBDOCFSINFO_CACHE variable to 1. If this is instead an upgrade from an existing installation don't add it until you have run: $ bibdocfile --fix-bibdocfsinfo-cache . The Invenio system will then read both the default invenio.conf file and your customized invenio-local.conf file and it will override any default options with the ones you have specifield in your local file. This cascading of configuration parameters will ease your future upgrades. If you want to have multiple Invenio instances for distributed video encoding, you need to share the same configuration amongs them and make some of the folders of the Invenio installation available for all nodes. Configure the allowed tasks for every node: CFG_BIBSCHED_NODE_TASKS = { "hostname_machine1" : ["bibindex", "bibupload", "bibreformat","webcoll", "bibtaskex", "bibrank", "oaiharvest", "oairepositoryupdater", "inveniogc", "webstatadmin", "bibclassify", "bibexport", "dbdump", "batchuploader", "bibauthorid", "bibtasklet"], "hostname_machine2" : ['bibencode',] } Share the following directories among Invenio instances: /var/tmp-shared hosts video uploads in a temporary form /var/tmp-shared/bibencode/jobs hosts new job files for the video encoding daemon /var/tmp-shared/bibencode/jobs/done hosts job files that have been processed by the daemon /var/data/files hosts fulltext and media files associated to records /var/data/submit hosts files created during submissions $ sudo -u www-data /opt/invenio/bin/inveniocfg --update-all Make the rest of the Invenio system aware of your invenio-local.conf changes. This step is mandatory each time you edit your conf files. $ sudo -u www-data /opt/invenio/bin/inveniocfg --create-tables If you are installing Invenio for the first time, you have to create database tables. Note that this step checks for potential problems such as the database connection rights and may ask you to perform some more administrative steps in case it detects a problem. Notably, it may ask you to set up database access permissions, based on your configure values. If you are installing Invenio for the first time, you have to create a dedicated database on your MySQL server that the Invenio can use for its purposes. Please contact your MySQL administrator and ask him to execute the commands this step proposes you. At this point you should now have successfully completed the "make install" process. We continue by setting up the Apache web server. $ sudo -u www-data /opt/invenio/bin/inveniocfg --load-bibfield-conf Load the configuration file of the BibField module. It will create `bibfield_config.py' file. (FIXME: When BibField becomes essential part of Invenio, this step should be later automatised so that people do not have to run it manually.) $ sudo -u www-data /opt/invenio/bin/inveniocfg --load-webstat-conf Load the configuration file of webstat module. It will create the tables in the database for register customevents, such as basket hits. $ sudo -u www-data /opt/invenio/bin/inveniocfg --create-apache-conf Running this command will generate Apache virtual host configurations matching your installation. You will be instructed to check created files (usually they are located under /opt/invenio/etc/apache/) and edit your httpd.conf to activate Invenio virtual hosts. If you are using Debian GNU/Linux ``Lenny'' or later, then you can do the following to create your SSL certificate and to activate your Invenio vhosts: ## make SSL certificate: $ sudo aptitude install ssl-cert $ sudo mkdir /etc/apache2/ssl $ sudo /usr/sbin/make-ssl-cert /usr/share/ssl-cert/ssleay.cnf \ /etc/apache2/ssl/apache.pem ## add Invenio web sites: $ sudo ln -s /opt/invenio/etc/apache/invenio-apache-vhost.conf \ /etc/apache2/sites-available/invenio $ sudo ln -s /opt/invenio/etc/apache/invenio-apache-vhost-ssl.conf \ /etc/apache2/sites-available/invenio-ssl ## disable Debian's default web site: $ sudo /usr/sbin/a2dissite default ## enable Invenio web sites: $ sudo /usr/sbin/a2ensite invenio $ sudo /usr/sbin/a2ensite invenio-ssl ## enable SSL module: $ sudo /usr/sbin/a2enmod ssl ## if you are using xsendfile module, enable it too: $ sudo /usr/sbin/a2enmod xsendfile If you are using another operating system, you should do the equivalent, for example edit your system-wide httpd.conf and put the following include statements: Include /opt/invenio/etc/apache/invenio-apache-vhost.conf Include /opt/invenio/etc/apache/invenio-apache-vhost-ssl.conf Note that you may need to adapt generated vhost file snippets to match your concrete operating system specifics. For example, the generated configuration snippet will preload Invenio WSGI daemon application upon Apache start up for faster site response. The generated configuration assumes that you are using mod_wsgi version 3 or later. If you are using the old legacy mod_wsgi version 2, then you would need to comment out the WSGIImportScript directive from the generated snippet, or else move the WSGI daemon setup to the top level, outside of the VirtualHost section. Note also that you may want to tweak the generated Apache vhost snippet for performance reasons, especially with respect to WSGIDaemonProcess parameters. For example, you can increase the number of processes from the default value `processes=5' if you have lots of RAM and if many concurrent users may access your site in parallel. However, note that you must use `threads=1' there, because Invenio WSGI daemon processes are not fully thread safe yet. This may change in the future. $ sudo /etc/init.d/apache2 restart Please ask your webserver administrator to restart the Apache server after the above "httpd.conf" changes. $ sudo -u www-data /opt/invenio/bin/inveniocfg --check-openoffice If you plan to support MS Office or Open Document Format files in your installation, you should check whether LibreOffice or OpenOffice.org is well integrated with Invenio by running the above command. You may be asked to create a temporary directory for converting office files with special ownership (typically as user nobody) and permissions. Note that you can do this step later. $ sudo -u www-data /opt/invenio/bin/inveniocfg --create-demo-site This step is recommended to test your local Invenio installation. It should give you our "Atlantis Institute of Science" demo installation, exactly as you see it at . $ sudo -u www-data /opt/invenio/bin/inveniocfg --load-demo-records Optionally, load some demo records to be able to test indexing and searching of your local Invenio demo installation. $ sudo -u www-data /opt/invenio/bin/inveniocfg --run-unit-tests Optionally, you can run the unit test suite to verify the unit behaviour of your local Invenio installation. Note that this command should be run only after you have installed the whole system via `make install'. $ sudo -u www-data /opt/invenio/bin/inveniocfg --run-regression-tests Optionally, you can run the full regression test suite to verify the functional behaviour of your local Invenio installation. Note that this command requires to have created the demo site and loaded the demo records. Note also that running the regression test suite may alter the database content with junk data, so that rebuilding the demo site is strongly recommended afterwards. $ sudo -u www-data /opt/invenio/bin/inveniocfg --run-web-tests Optionally, you can run additional automated web tests running in a real browser. This requires to have Firefox with the Selenium IDE extension installed. $ sudo -u www-data /opt/invenio/bin/inveniocfg --remove-demo-records Optionally, remove the demo records loaded in the previous step, but keeping otherwise the demo collection, submission, format, and other configurations that you may reuse and modify for your own production purposes. $ sudo -u www-data /opt/invenio/bin/inveniocfg --drop-demo-site Optionally, drop also all the demo configuration so that you'll end up with a completely blank Invenio system. However, you may want to find it more practical not to drop the demo site configuration but to start customizing from there. $ firefox http://your.site.com/help/admin/howto-run In order to start using your Invenio installation, you can start indexing, formatting and other daemons as indicated in the "HOWTO Run" guide on the above URL. You can also use the Admin Area web interfaces to perform further runtime configurations such as the definition of data collections, document types, document formats, word indexes, etc. $ sudo ln -s /opt/invenio/etc/bash_completion.d/inveniocfg \ /etc/bash_completion.d/inveniocfg Optionally, if you are using Bash shell completion, then you may want to create the above symlink in order to configure completion for the inveniocfg command. Good luck and thanks for choosing Invenio. - Invenio Development Team Email: info@invenio-software.org IRC: #invenio on irc.freenode.net Twitter: http://twitter.com/inveniosoftware Github: http://github.com/inveniosoftware URL: http://invenio-software.org diff --git a/configure-tests.py b/configure-tests.py index 4793988af..07f1c4ec3 100644 --- a/configure-tests.py +++ b/configure-tests.py @@ -1,532 +1,513 @@ ## This file is part of Invenio. ## Copyright (C) 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """ Test the suitability of Python core and the availability of various Python modules for running Invenio. Warn the user if there are eventual troubles. Exit status: 0 if okay, 1 if not okay. Useful for running from configure.ac. """ ## minimally recommended/required versions: cfg_min_python_version = "2.6" cfg_max_python_version = "2.9.9999" cfg_min_mysqldb_version = "1.2.1_p2" ## 0) import modules needed for this testing: import string import sys import getpass import subprocess import re error_messages = [] warning_messages = [] def wait_for_user(msg): """Print MSG and prompt user for confirmation.""" try: raw_input(msg) except KeyboardInterrupt: print "\n\nInstallation aborted." sys.exit(1) except EOFError: print " (continuing in batch mode)" return ## 1) check Python version: if sys.version < cfg_min_python_version: error_messages.append( """ ******************************************************* ** ERROR: TOO OLD PYTHON DETECTED: %s ******************************************************* ** You seem to be using a too old version of Python. ** ** You must use at least Python %s. ** ** ** ** Note that if you have more than one Python ** ** installed on your system, you can specify the ** ** --with-python configuration option to choose ** ** a specific (e.g. non system wide) Python binary. ** ** ** ** Please upgrade your Python before continuing. ** ******************************************************* """ % (string.replace(sys.version, "\n", ""), cfg_min_python_version) ) if sys.version > cfg_max_python_version: error_messages.append( """ ******************************************************* ** ERROR: TOO NEW PYTHON DETECTED: %s ******************************************************* ** You seem to be using a too new version of Python. ** ** You must use at most Python %s. ** ** ** ** Perhaps you have downloaded and are installing an ** ** old Invenio version? Please look for more recent ** ** Invenio version or please contact the development ** ** team at about this ** ** problem. ** ** ** ** Installation aborted. ** ******************************************************* """ % (string.replace(sys.version, "\n", ""), cfg_max_python_version) ) ## 2) check for required modules: try: import MySQLdb import base64 import cPickle import cStringIO import cgi import copy import fileinput import getopt import sys if sys.hexversion < 0x2060000: import md5 else: import hashlib import marshal import os import pyparsing import signal import tempfile import time import traceback import unicodedata import urllib import zlib import wsgiref + import unidecode except ImportError, msg: error_messages.append(""" ************************************************* ** IMPORT ERROR %s ************************************************* ** Perhaps you forgot to install some of the ** ** prerequisite Python modules? Please look ** ** at our INSTALL file for more details and ** ** fix the problem before continuing! ** ************************************************* """ % msg ) ## 3) check for recommended modules: try: import rdflib except ImportError, msg: warning_messages.append( """ ***************************************************** ** IMPORT WARNING %s ***************************************************** ** Note that rdflib is needed only if you plan ** ** to work with the automatic classification of ** ** documents based on RDF-based taxonomies. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg ) try: import pyRXP except ImportError, msg: warning_messages.append(""" ***************************************************** ** IMPORT WARNING %s ***************************************************** ** Note that PyRXP is not really required but ** ** we recommend it for fast XML MARC parsing. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg ) try: import dateutil except ImportError, msg: warning_messages.append(""" ***************************************************** ** IMPORT WARNING %s ***************************************************** ** Note that dateutil is not really required but ** ** we recommend it for user-friendly date ** ** parsing. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg ) try: import libxml2 except ImportError, msg: warning_messages.append(""" ***************************************************** ** IMPORT WARNING %s ***************************************************** ** Note that libxml2 is not really required but ** ** we recommend it for XML metadata conversions ** ** and for fast XML parsing. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg ) try: import libxslt except ImportError, msg: warning_messages.append( """ ***************************************************** ** IMPORT WARNING %s ***************************************************** ** Note that libxslt is not really required but ** ** we recommend it for XML metadata conversions. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg ) try: import Gnuplot except ImportError, msg: warning_messages.append( """ ***************************************************** ** IMPORT WARNING %s ***************************************************** ** Note that Gnuplot.py is not really required but ** ** we recommend it in order to have nice download ** ** and citation history graphs on Detailed record ** ** pages. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg ) try: import rauth except ImportError, msg: warning_messages.append( """ ***************************************************** ** IMPORT WARNING %s ***************************************************** ** Note that python-rauth is not really required ** ** but we recommend it in order to enable oauth ** ** based authentication. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg ) try: import openid except ImportError, msg: warning_messages.append( """ ***************************************************** ** IMPORT WARNING %s ***************************************************** ** Note that python-openid is not really required ** ** but we recommend it in order to enable OpenID ** ** based authentication. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg ) try: import magic if not hasattr(magic, "open"): raise StandardError except ImportError, msg: warning_messages.append( """ ***************************************************** ** IMPORT WARNING %s ***************************************************** ** Note that magic module is not really required ** ** but we recommend it in order to have detailed ** ** content information about fulltext files. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg ) except StandardError: warning_messages.append( """ ***************************************************** ** IMPORT WARNING python-magic ***************************************************** ** The python-magic package you installed is not ** ** the one supported by Invenio. Please refer to ** ** the INSTALL file for more details. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ ) try: import reportlab except ImportError, msg: warning_messages.append( """ ***************************************************** ** IMPORT WARNING %s ***************************************************** ** Note that reportlab module is not really ** ** required, but we recommend it you want to ** ** enrich PDF with OCR information. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg ) try: try: import PyPDF2 except ImportError: import pyPdf except ImportError, msg: warning_messages.append( """ ***************************************************** ** IMPORT WARNING %s ***************************************************** ** Note that pyPdf or pyPdf2 module is not really ** ** required, but we recommend it you want to ** ** enrich PDF with OCR information. ** ** ** ** You can safely continue installing Invenio ** ** now, and add this module anytime later. (I.e. ** ** even after your Invenio installation is put ** ** into production.) ** ***************************************************** """ % msg ) -try: - import unidecode -except ImportError, msg: - warning_messages.append( - """ - ***************************************************** - ** IMPORT WARNING %s - ***************************************************** - ** Note that unidecode module is not really ** - ** required, but we recommend it you want to ** - ** introduce smarter author names matching. ** - ** ** - ** You can safely continue installing Invenio ** - ** now, and add this module anytime later. (I.e. ** - ** even after your Invenio installation is put ** - ** into production.) ** - ***************************************************** - """ % msg - ) - ## 4) check for versions of some important modules: if MySQLdb.__version__ < cfg_min_mysqldb_version: error_messages.append( """ ***************************************************** ** ERROR: PYTHON MODULE MYSQLDB %s DETECTED ***************************************************** ** You have to upgrade your MySQLdb to at least ** ** version %s. You must fix this problem ** ** before continuing. Please see the INSTALL file ** ** for more details. ** ***************************************************** """ % (MySQLdb.__version__, cfg_min_mysqldb_version) ) try: import Stemmer try: from Stemmer import algorithms except ImportError, msg: error_messages.append( """ ***************************************************** ** ERROR: STEMMER MODULE PROBLEM %s ***************************************************** ** Perhaps you are using an old Stemmer version? ** ** You must either remove your old Stemmer or else ** ** upgrade to Snowball Stemmer ** ** before continuing. Please see the INSTALL file ** ** for more details. ** ***************************************************** """ % (msg) ) except ImportError: pass # no prob, Stemmer is optional ## 5) check for Python.h (needed for intbitset): try: from distutils.sysconfig import get_python_inc path_to_python_h = get_python_inc() + os.sep + 'Python.h' if not os.path.exists(path_to_python_h): raise StandardError, "Cannot find %s" % path_to_python_h except StandardError, msg: error_messages.append( """ ***************************************************** ** ERROR: PYTHON HEADER FILE ERROR %s ***************************************************** ** You do not seem to have Python developer files ** ** installed (such as Python.h). Some operating ** ** systems provide these in a separate Python ** ** package called python-dev or python-devel. ** ** You must install such a package before ** ** continuing the installation process. ** ***************************************************** """ % (msg) ) ## 6) Check if ffmpeg is installed and if so, with the minimum configuration for bibencode try: try: process = subprocess.Popen('ffprobe', stderr=subprocess.PIPE, stdout=subprocess.PIPE) except OSError: raise StandardError, "FFMPEG/FFPROBE does not seem to be installed!" returncode = process.wait() output = process.communicate()[1] RE_CONFIGURATION = re.compile("(--enable-[a-z0-9\-]*)") CONFIGURATION_REQUIRED = ( '--enable-gpl', '--enable-version3', '--enable-nonfree', '--enable-libtheora', '--enable-libvorbis', '--enable-libvpx', '--enable-libopenjpeg' ) options = RE_CONFIGURATION.findall(output) if sys.version_info < (2, 6): import sets s = sets.Set(CONFIGURATION_REQUIRED) if not s.issubset(options): raise StandardError, options.difference(s) else: if not set(CONFIGURATION_REQUIRED).issubset(options): raise StandardError, set(CONFIGURATION_REQUIRED).difference(options) except StandardError, msg: warning_messages.append( """ ***************************************************** ** WARNING: FFMPEG CONFIGURATION MISSING %s ***************************************************** ** You do not seem to have FFmpeg configured with ** ** the minimum video codecs to run the demo site. ** ** Please install the necessary libraries and ** ** re-install FFmpeg according to the Invenio ** ** installation manual (INSTALL). ** ***************************************************** """ % (msg) ) if warning_messages: print """ ****************************************************** ** WARNING MESSAGES ** ****************************************************** """ for warning in warning_messages: print warning if error_messages: print """ ****************************************************** ** ERROR MESSAGES ** ****************************************************** """ for error in error_messages: print error if warning_messages and error_messages: print """ There were %(n_err)s error(s) found that you need to solve. Please see above, solve them, and re-run configure. Note that there are also %(n_wrn)s warnings you may want to look into. Aborting the installation. """ % {'n_wrn': len(warning_messages), 'n_err': len(error_messages)} sys.exit(1) elif error_messages: print """ There were %(n_err)s error(s) found that you need to solve. Please see above, solve them, and re-run configure. Aborting the installation. """ % {'n_err': len(error_messages)} sys.exit(1) elif warning_messages: print """ There were %(n_wrn)s warnings found that you may want to look into, solve, and re-run configure before you continue the installation. However, you can also continue the installation now and solve these issues later, if you wish. """ % {'n_wrn': len(warning_messages)} diff --git a/modules/bibsort/lib/bibsort_washer.py b/modules/bibsort/lib/bibsort_washer.py index 73938e2f0..52fecdbe2 100644 --- a/modules/bibsort/lib/bibsort_washer.py +++ b/modules/bibsort/lib/bibsort_washer.py @@ -1,133 +1,142 @@ ## -*- mode: python; coding: utf-8; -*- ## ## This file is part of Invenio. ## Copyright (C) 2010, 2011, 2012 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """Applies a transformation function to a value""" -from time import strptime -from invenio.dateutils import strftime -from invenio.textutils import strip_accents +import re +from invenio.dateutils import strftime, strptime +from invenio.textutils import decode_to_unicode, translate_to_ascii LEADING_ARTICLES = ['the', 'a', 'an', 'at', 'on', 'of'] +_RE_NOSYMBOLS = re.compile("\w+") class InvenioBibSortWasherNotImplementedError(Exception): """Exception raised when a washer method defined in the bibsort config file is not implemented""" pass class BibSortWasher(object): """Implements all the washer methods""" def __init__(self, washer): self.washer = washer fnc_name = '_' + washer try: self.washer_fnc = self.__getattribute__(fnc_name) except AttributeError, err: raise InvenioBibSortWasherNotImplementedError(err) def get_washer(self): """Returns the washer name""" return self.washer def get_transformed_value(self, val): """Returns the value""" return self.washer_fnc(val) def _sort_alphanumerically_remove_leading_articles_strip_accents(self, val): """ Convert: 'The title' => 'title' 'A title' => 'title' 'Title' => 'title' """ if not val: return '' - val_tokens = str(val).split(" ", 1) #split in leading_word, phrase_without_leading_word - if len(val_tokens) == 2 and val_tokens[0].lower() in LEADING_ARTICLES: - return strip_accents(val_tokens[1].strip().lower()) - return strip_accents(val.lower()) + val = translate_to_ascii(val).pop().lower() + val_tokens = val.split(" ", 1) #split in leading_word, phrase_without_leading_word + if len(val_tokens) == 2 and val_tokens[0].strip() in LEADING_ARTICLES: + return val_tokens[1].strip() + return val.strip() def _sort_alphanumerically_remove_leading_articles(self, val): """ Convert: 'The title' => 'title' 'A title' => 'title' 'Title' => 'title' """ if not val: return '' - val_tokens = str(val).split(" ", 1) #split in leading_word, phrase_without_leading_word - if len(val_tokens) == 2 and val_tokens[0].lower() in LEADING_ARTICLES: - return val_tokens[1].strip().lower() - return val.lower() + val = decode_to_unicode(val).lower().encode('UTF-8') + val_tokens = val.split(" ", 1) #split in leading_word, phrase_without_leading_word + if len(val_tokens) == 2 and val_tokens[0].strip() in LEADING_ARTICLES: + return val_tokens[1].strip() + return val.strip() def _sort_case_insensitive_strip_accents(self, val): """Remove accents and convert to lower case""" if not val: return '' - return strip_accents(str(val).lower()) + return translate_to_ascii(val).pop().lower() + + def _sort_nosymbols_case_insensitive_strip_accents(self, val): + """Remove accents, remove symbols, and convert to lower case""" + if not val: + return '' + return ''.join(_RE_NOSYMBOLS.findall(translate_to_ascii(val).pop().lower())) def _sort_case_insensitive(self, val): """Conversion to lower case""" if not val: return '' - return str(val).lower() + return decode_to_unicode(val).lower().encode('UTF-8') def _sort_dates(self, val): """ Convert: '8 nov 2010' => '2010-11-08' 'nov 2010' => '2010-11-01' '2010' => '2010-01-01' """ datetext_format = "%Y-%m-%d" try: datestruct = strptime(val, datetext_format) except ValueError: try: datestruct = strptime(val, "%d %b %Y") except ValueError: try: datestruct = strptime(val, "%b %Y") except ValueError: try: datestruct = strptime(val, "%Y") except ValueError: return val return strftime(datetext_format, datestruct) def _sort_numerically(self, val): """ Convert: 1245 => float(1245) """ try: return float(val) except ValueError: return 0 def get_all_available_washers(): """ Returns all the available washer functions without the leading '_' """ method_list = dir(BibSortWasher) return [method[1:] for method in method_list if method.startswith('_') and method.find('__') < 0] diff --git a/modules/bibsort/lib/bibsort_washer_unit_tests.py b/modules/bibsort/lib/bibsort_washer_unit_tests.py index 8dd94152a..a51cf4f0a 100644 --- a/modules/bibsort/lib/bibsort_washer_unit_tests.py +++ b/modules/bibsort/lib/bibsort_washer_unit_tests.py @@ -1,64 +1,72 @@ ## -*- mode: python; coding: utf-8; -*- ## ## This file is part of Invenio. ## Copyright (C) 2010, 2011, 2012 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """Testing module for BibSort Method Treatment""" from invenio.testutils import InvenioTestCase from invenio.bibsort_washer import BibSortWasher from invenio.testutils import make_test_suite, run_test_suite class TestBibSortWasherCreation(InvenioTestCase): """Test BibSortWasher Creation.""" def test_method_creation(self): """Tests the creation of a method""" method = 'sort_alphanumerically_remove_leading_articles' bsm = BibSortWasher(method) self.assertEqual(bsm.get_washer(), method) class TestBibSortWasherWashers(InvenioTestCase): """Test BibSortWasher Washers.""" def test_sort_alphanumerically_remove_leading_articles(self): """Test the sort_alphanumerically_remove_leading_articles method""" method = "sort_alphanumerically_remove_leading_articles" bsm = BibSortWasher(method) self.assertEqual('title of a record', bsm.get_transformed_value('The title of a record')) self.assertEqual('title of a record', bsm.get_transformed_value('a title of a record')) self.assertEqual('the', bsm.get_transformed_value('The')) def test_sort_dates(self): """Test the sort_dates method""" method = "sort_dates" bsm = BibSortWasher(method) self.assertEqual('2010-01-10', bsm.get_transformed_value('2010-01-10')) self.assertEqual('2010-11-10', bsm.get_transformed_value('10 nov 2010')) self.assertEqual('2010-11-01', bsm.get_transformed_value('nov 2010')) self.assertEqual('2010-01-01', bsm.get_transformed_value('2010')) self.assertEqual('2010-11-08', bsm.get_transformed_value('8 nov 2010')) + def test_sort_nosymbols_case_insensitive_strip_accents(self): + """Test the sort_nosymbols_case_insensitive_strip_accents method""" + method = "sort_nosymbols_case_insensitive_strip_accents" + bsm = BibSortWasher(method) + self.assertEqual("thooftgerardus", bsm.get_transformed_value("'t Hooft, Gerardus")) + self.assertEqual("ahearnmichaelf", bsm.get_transformed_value("A'Hearn, Michael F.")) + self.assertEqual("zvolskymilan", bsm.get_transformed_value("Zvolský, Milan")) + TEST_SUITE = make_test_suite(TestBibSortWasherWashers, TestBibSortWasherCreation) if __name__ == "__main__": run_test_suite(TEST_SUITE) diff --git a/modules/docextract/lib/refextract_tag.py b/modules/docextract/lib/refextract_tag.py index e93cff128..389700a8d 100644 --- a/modules/docextract/lib/refextract_tag.py +++ b/modules/docextract/lib/refextract_tag.py @@ -1,1415 +1,1410 @@ # -*- coding: utf-8 -*- ## ## This file is part of Invenio. ## Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010, 2011 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. import re -try: - from unidecode import unidecode - UNIDECODE_AVAILABLE = True -except ImportError: - UNIDECODE_AVAILABLE = False +from unidecode import unidecode from invenio.refextract_config import \ CFG_REFEXTRACT_MARKER_CLOSING_AUTHOR_ETAL, \ CFG_REFEXTRACT_MARKER_CLOSING_AUTHOR_INCL, \ CFG_REFEXTRACT_MARKER_CLOSING_AUTHOR_STND, \ CFG_REFEXTRACT_MARKER_CLOSING_TITLE_IBID, \ CFG_REFEXTRACT_MARKER_OPENING_TITLE_IBID, \ CFG_REFEXTRACT_MARKER_OPENING_COLLABORATION, \ CFG_REFEXTRACT_MARKER_CLOSING_COLLABORATION from invenio.docextract_text import remove_and_record_multiple_spaces_in_line from invenio.refextract_re import \ re_ibid, \ re_doi, \ re_raw_url, \ re_series_from_numeration, \ re_punctuation, \ re_correct_numeration_2nd_try_ptn1, \ re_correct_numeration_2nd_try_ptn2, \ re_correct_numeration_2nd_try_ptn3, \ re_correct_numeration_2nd_try_ptn4, \ re_numeration_nucphys_vol_page_yr, \ re_numeration_vol_subvol_nucphys_yr_page, \ re_numeration_nucphys_vol_yr_page, \ re_multiple_hyphens, \ re_numeration_vol_page_yr, \ re_numeration_vol_yr_page, \ re_numeration_vol_nucphys_series_yr_page, \ re_numeration_vol_series_nucphys_page_yr, \ re_numeration_vol_nucphys_series_page_yr, \ re_html_tagged_url, \ re_numeration_yr_vol_page, \ re_numeration_vol_nucphys_page_yr, \ re_wash_volume_tag, \ re_numeration_vol_nucphys_yr_subvol_page, \ re_quoted, \ re_isbn, \ re_arxiv, \ re_arxiv_5digits, \ re_new_arxiv, \ re_new_arxiv_5digits, \ re_pos, \ re_pos_year_num, \ re_series_from_numeration_after_volume, \ RE_OLD_ARXIV, \ RE_ARXIV_CATCHUP, \ RE_ATLAS_CONF_PRE_2010, \ RE_ATLAS_CONF_POST_2010 from invenio.authorextract_re import (get_author_regexps, etal_matches, re_ed_notation, re_etal) from invenio.docextract_text import wash_line def tag_reference_line(line, kbs, record_titles_count): # take a copy of the line as a first working line, clean it of bad # accents, and correct puncutation, etc: working_line1 = wash_line(line) # Identify volume for POS journal working_line1 = tag_pos_volume(working_line1) # Clean the line once more: working_line1 = wash_line(working_line1) # We identify quoted text # This is useful for books matching # This is also used by the author tagger to remove quoted # text which is a sign of a title and not an author working_line1 = tag_quoted_text(working_line1) # Identify ISBN (for books) working_line1 = tag_isbn(working_line1) # Identify arxiv reports working_line1 = tag_arxiv(working_line1) working_line1 = tag_arxiv_more(working_line1) # Identify volume for POS journal # needs special handling because the volume contains the year working_line1 = tag_pos_volume(working_line1) # Identify ATL-CONF and ATLAS-CONF report numbers # needs special handling because it has 2 formats depending on the year # and a 2 years digit format to convert working_line1 = tag_atlas_conf(working_line1) # Identify journals with regular expression # Some journals need to match exact regexps because they can # conflict with other elements # e.g. DAN is also a common first name standardised_titles = kbs['journals'][1] standardised_titles.update(kbs['journals_re']) journals_matches = identifiy_journals_re(working_line1, kbs['journals_re']) # Remove identified tags working_line2 = strip_tags(working_line1) # Transform the line to upper-case, now making a new working line: working_line2 = working_line2.upper() # Strip punctuation from the line: working_line2 = re_punctuation.sub(u' ', working_line2) # Remove multiple spaces from the line, recording # information about their coordinates: removed_spaces, working_line2 = \ remove_and_record_multiple_spaces_in_line(working_line2) # Identify and record coordinates of institute preprint report numbers: found_pprint_repnum_matchlens, found_pprint_repnum_replstr, working_line2 =\ identify_report_numbers(working_line2, kbs['report-numbers']) # Identify and record coordinates of non-standard journal titles: journals_matches_more, working_line2, line_titles_count = \ identify_journals(working_line2, kbs['journals']) journals_matches.update(journals_matches_more) # Add the count of 'bad titles' found in this line to the total # for the reference section: record_titles_count = sum_2_dictionaries(record_titles_count, line_titles_count) # Attempt to identify, record and replace any IBIDs in the line: if (working_line2.upper().find(u"IBID") != -1): # there is at least one IBID in the line - try to # identify its meaning: found_ibids_matchtext, working_line2 = \ identify_ibids(working_line2) # now update the dictionary of matched title lengths with the # matched IBID(s) lengths information: journals_matches.update(found_ibids_matchtext) publishers_matches = identify_publishers(working_line2, kbs['publishers']) tagged_line = process_reference_line( working_line=working_line1, journals_matches=journals_matches, pprint_repnum_len=found_pprint_repnum_matchlens, pprint_repnum_matchtext=found_pprint_repnum_replstr, publishers_matches=publishers_matches, removed_spaces=removed_spaces, standardised_titles=standardised_titles, kbs=kbs, ) return tagged_line, record_titles_count def process_reference_line(working_line, journals_matches, pprint_repnum_len, pprint_repnum_matchtext, publishers_matches, removed_spaces, standardised_titles, kbs): """After the phase of identifying and tagging citation instances in a reference line, this function is called to go through the line and the collected information about the recognised citations, and to transform the line into a string of MARC XML in which the recognised citations are grouped under various datafields and subfields, depending upon their type. @param line_marker: (string) - this is the marker for this reference line (e.g. [1]). @param working_line: (string) - this is the line before the punctuation was stripped. At this stage, it has not been capitalised, and neither TITLES nor REPORT NUMBERS have been stripped from it. However, any recognised numeration and/or URLs have been tagged with tags. The working_line could, for example, look something like this: [1] CDS http //invenio-software.org/. @param found_title_len: (dictionary) - the lengths of the title citations that have been recognised in the line. Keyed by the index within the line of each match. @param found_title_matchtext: (dictionary) - The text that was found for each matched title citation in the line. Keyed by the index within the line of each match. @param pprint_repnum_len: (dictionary) - the lengths of the matched institutional preprint report number citations found within the line. Keyed by the index within the line of each match. @param pprint_repnum_matchtext: (dictionary) - The matched text for each matched institutional report number. Keyed by the index within the line of each match. @param identified_dois (list) - The list of dois inside the citation @identified_urls: (list) - contains 2-cell tuples, each of which represents an idenitfied URL and its description string. The list takes the order in which the URLs were identified in the line (i.e. first-found, second-found, etc). @param removed_spaces: (dictionary) - The number of spaces removed from the various positions in the line. Keyed by the index of the position within the line at which the spaces were removed. @param standardised_titles: (dictionary) - The standardised journal titles, keyed by the non-standard version of those titles. @return: (tuple) of 5 components: ( string -> a MARC XML-ized reference line. integer -> number of fields of miscellaneous text marked-up for the line. integer -> number of title citations marked-up for the line. integer -> number of institutional report-number citations marked-up for the line. integer -> number of URL citations marked-up for the record. integer -> number of DOI's found for the record integer -> number of author groups found ) """ if len(journals_matches) + len(pprint_repnum_len) + len(publishers_matches) == 0: # no TITLE or REPORT-NUMBER citations were found within this line, # use the raw line: (This 'raw' line could still be tagged with # recognised URLs or numeration.) tagged_line = working_line else: # TITLE and/or REPORT-NUMBER citations were found in this line, # build a new version of the working-line in which the standard # versions of the REPORT-NUMBERs and TITLEs are tagged: startpos = 0 # First cell of the reference line... previous_match = {} # previously matched TITLE within line (used # for replacement of IBIDs. replacement_types = {} journals_keys = journals_matches.keys() journals_keys.sort() reports_keys = pprint_repnum_matchtext.keys() reports_keys.sort() publishers_keys = publishers_matches.keys() publishers_keys.sort() spaces_keys = removed_spaces.keys() spaces_keys.sort() replacement_types = get_replacement_types(journals_keys, reports_keys, publishers_keys) replacement_locations = replacement_types.keys() replacement_locations.sort() tagged_line = u"" # This is to be the new 'working-line'. It will # contain the tagged TITLEs and REPORT-NUMBERs, # as well as any previously tagged URLs and # numeration components. # begin: for replacement_index in replacement_locations: # first, factor in any stripped spaces before this 'replacement' true_replacement_index, extras = \ account_for_stripped_whitespace(spaces_keys, removed_spaces, replacement_types, pprint_repnum_len, journals_matches, replacement_index) if replacement_types[replacement_index] == u"journal": # Add a tagged periodical TITLE into the line: rebuilt_chunk, startpos, previous_match = \ add_tagged_journal( reading_line=working_line, journal_info=journals_matches[replacement_index], previous_match=previous_match, startpos=startpos, true_replacement_index=true_replacement_index, extras=extras, standardised_titles=standardised_titles) tagged_line += rebuilt_chunk elif replacement_types[replacement_index] == u"reportnumber": # Add a tagged institutional preprint REPORT-NUMBER # into the line: rebuilt_chunk, startpos = \ add_tagged_report_number( reading_line=working_line, len_reportnum=pprint_repnum_len[replacement_index], reportnum=pprint_repnum_matchtext[replacement_index], startpos=startpos, true_replacement_index=true_replacement_index, extras=extras) tagged_line += rebuilt_chunk elif replacement_types[replacement_index] == u"publisher": rebuilt_chunk, startpos = \ add_tagged_publisher( reading_line=working_line, matched_publisher=publishers_matches[replacement_index], startpos=startpos, true_replacement_index=true_replacement_index, extras=extras, kb_publishers=kbs['publishers']) tagged_line += rebuilt_chunk # add the remainder of the original working-line into the rebuilt line: tagged_line += working_line[startpos:] # we have all the numeration # we can make sure there's no space between the volume # letter and the volume number # e.g. B 20 -> B20 tagged_line = wash_volume_tag(tagged_line) # Try to find any authors in the line tagged_line = identify_and_tag_authors(tagged_line, kbs['authors']) # Try to find any collaboration in the line tagged_line = identify_and_tag_collaborations(tagged_line, kbs['collaborations']) return tagged_line.replace('\n', '') def wash_volume_tag(line): return re_wash_volume_tag[0].sub(re_wash_volume_tag[1], line) def tag_isbn(line): """Tag books ISBN""" return re_isbn.sub(ur'\g', line) def tag_quoted_text(line): """Tag quoted titles We use titles for pretty display of references that we could not associate we record. We also use titles for recognising books. """ return re_quoted.sub(ur'\g</cds.QUOTED>', line) def tag_arxiv(line): """Tag arxiv report numbers We handle arXiv in 2 ways: * starting with arXiv:1022.1111 * this format exactly 9999.9999 We also format the output to the standard arxiv notation: * arXiv:2007.12.1111 * arXiv:2007.12.1111v2 """ def tagger(match): groups = match.groupdict() if match.group('suffix'): groups['suffix'] = ' ' + groups['suffix'] else: groups['suffix'] = '' return u'<cds.REPORTNUMBER>arXiv:%(year)s'\ u'%(month)s.%(num)s%(suffix)s' \ u'</cds.REPORTNUMBER>' % groups line = re_arxiv_5digits.sub(tagger, line) line = re_arxiv.sub(tagger, line) line = re_new_arxiv_5digits.sub(tagger, line) line = re_new_arxiv.sub(tagger, line) return line def tag_arxiv_more(line): """Tag old arxiv report numbers Either formats: * hep-th/1234567 * arXiv:1022111 [hep-ph] which transforms to hep-ph/1022111 """ line = RE_ARXIV_CATCHUP.sub(ur"\g<suffix>/\g<year>\g<month>\g<num>", line) for report_re, report_repl in RE_OLD_ARXIV: report_number = report_repl + ur"/\g<num>" line = report_re.sub(u'<cds.REPORTNUMBER>' + report_number + u'</cds.REPORTNUMBER>', line) return line def tag_pos_volume(line): """Tag POS volume number POS is journal that has special volume numbers e.g. PoS LAT2007 (2007) 369 """ def tagger(match): groups = match.groupdict() try: year = match.group('year') except IndexError: # Extract year from volume name # which should always include the year g = re.search(re_pos_year_num, match.group('volume_num'), re.UNICODE) year = g.group(0) if year: groups['year'] = ' <cds.YR>(%s)</cds.YR>' % year.strip().strip('()') else: groups['year'] = '' return '<cds.JOURNAL>PoS</cds.JOURNAL>' \ ' <cds.VOL>%(volume_name)s%(volume_num)s</cds.VOL>' \ '%(year)s' \ ' <cds.PG>%(page)s</cds.PG>' % groups for p in re_pos: line = p.sub(tagger, line) return line def tag_atlas_conf(line): line = RE_ATLAS_CONF_PRE_2010.sub( ur'<cds.REPORTNUMBER>ATL-CONF-\g<code></cds.REPORTNUMBER>', line) line = RE_ATLAS_CONF_POST_2010.sub( ur'<cds.REPORTNUMBER>ATLAS-CONF-\g<code></cds.REPORTNUMBER>', line) return line def identifiy_journals_re(line, kb_journals): matches = {} for pattern, dummy_journal in kb_journals: match = re.search(pattern, line) if match: matches[match.start()] = match.group(0) return matches def find_numeration_more(line): """Look for other numeration in line.""" # First, attempt to use marked-up titles patterns = ( re_correct_numeration_2nd_try_ptn1, re_correct_numeration_2nd_try_ptn2, re_correct_numeration_2nd_try_ptn3, re_correct_numeration_2nd_try_ptn4, ) for pattern in patterns: match = pattern.search(line) if match: info = match.groupdict() series = extract_series_from_volume(info['vol']) if not info['vol_num']: info['vol_num'] = info['vol_num_alt'] if not info['vol_num']: info['vol_num'] = info['vol_num_alt2'] return {'year': info.get('year', None), 'series': series, 'volume': info['vol_num'], 'page': info['page'], 'page_end': info['page_end'], 'len': len(info['aftertitle'])} return None def add_tagged_report_number(reading_line, len_reportnum, reportnum, startpos, true_replacement_index, extras): """In rebuilding the line, add an identified institutional REPORT-NUMBER (standardised and tagged) into the line. @param reading_line: (string) The reference line before capitalization was performed, and before REPORT-NUMBERs and TITLEs were stipped out. @param len_reportnum: (integer) the length of the matched REPORT-NUMBER. @param reportnum: (string) the replacement text for the matched REPORT-NUMBER. @param startpos: (integer) the pointer to the next position in the reading-line from which to start rebuilding. @param true_replacement_index: (integer) the replacement index of the matched REPORT-NUMBER in the reading-line, with stripped punctuation and whitespace accounted for. @param extras: (integer) extras to be added into the replacement index. @return: (tuple) containing a string (the rebuilt line segment) and an integer (the next 'startpos' in the reading-line). """ rebuilt_line = u"" # The segment of the line that's being rebuilt to # include the tagged & standardised REPORT-NUMBER # Fill rebuilt_line with the contents of the reading_line up to the point # of the institutional REPORT-NUMBER. However, stop 1 character before the # replacement index of this REPORT-NUMBER to allow for removal of braces, # if necessary: if (true_replacement_index - startpos - 1) >= 0: rebuilt_line += reading_line[startpos:true_replacement_index - 1] else: rebuilt_line += reading_line[startpos:true_replacement_index] # Add the tagged REPORT-NUMBER into the rebuilt-line segment: rebuilt_line += u"<cds.REPORTNUMBER>%(reportnum)s</cds.REPORTNUMBER>" \ % {'reportnum' : reportnum} # Move the pointer in the reading-line past the current match: startpos = true_replacement_index + len_reportnum + extras # Move past closing brace for report number (if there was one): try: if reading_line[startpos] in (u"]", u")"): startpos += 1 except IndexError: # moved past end of line - ignore pass # return the rebuilt-line segment and the pointer to the next position in # the reading-line from which to start rebuilding up to the next match: return rebuilt_line, startpos def add_tagged_journal_in_place_of_IBID(previous_match): """In rebuilding the line, if the matched TITLE was actually an IBID, this function will replace it with the previously matched TITLE, and add it into the line, tagged. It will even handle the series letter, if it differs. For example, if the previous match is "Nucl. Phys. B", and the ibid is "IBID A", the title inserted into the line will be "Nucl. Phys. A". Otherwise, if the IBID had no series letter, it will simply be replaced by "Nucl. Phys. B" (i.e. the previous match.) @param previous_match: (string) - the previously matched TITLE. @param ibid_series: (string) - the series of the IBID (if any). @return: (tuple) containing a string (the rebuilt line segment) and an other string (the newly updated previous-match). """ return " %s%s%s" % (CFG_REFEXTRACT_MARKER_OPENING_TITLE_IBID, previous_match['title'], CFG_REFEXTRACT_MARKER_CLOSING_TITLE_IBID) def extract_series_from_volume(volume): patterns = (re_series_from_numeration, re_series_from_numeration_after_volume) for p in patterns: match = p.search(volume) if match: return match.group(1) return None def create_numeration_tag(info): if info['series']: series_and_volume = info['series'] + info['volume'] else: series_and_volume = info['volume'] numeration_tags = u' <cds.VOL>%s</cds.VOL>' % series_and_volume if info.get('year', False): numeration_tags += u' <cds.YR>(%(year)s)</cds.YR>' % info if info.get('page_end', False): numeration_tags += u' <cds.PG>%(page)s-%(page_end)s</cds.PG>' % info else: numeration_tags += u' <cds.PG>%(page)s</cds.PG>' % info return numeration_tags def add_tagged_journal(reading_line, journal_info, previous_match, startpos, true_replacement_index, extras, standardised_titles): """In rebuilding the line, add an identified periodical TITLE (standardised and tagged) into the line. @param reading_line: (string) The reference line before capitalization was performed, and before REPORT-NUMBERs and TITLEs were stripped out. @param len_title: (integer) the length of the matched TITLE. @param matched_title: (string) the matched TITLE text. @param previous_match: (dict) the previous periodical TITLE citation to have been matched in the current reference line. It is used when replacing an IBID instance in the line. @param startpos: (integer) the pointer to the next position in the reading-line from which to start rebuilding. @param true_replacement_index: (integer) the replacement index of the matched TITLE in the reading-line, with stripped punctuation and whitespace accounted for. @param extras: (integer) extras to be added into the replacement index. @param standardised_titles: (dictionary) the standardised versions of periodical titles, keyed by their various non-standard versions. @return: (tuple) containing a string (the rebuilt line segment), an integer (the next 'startpos' in the reading-line), and an other string (the newly updated previous-match). """ old_startpos = startpos old_previous_match = previous_match skip_numeration = False series = None def skip_ponctuation(line, pos): # Skip past any punctuation at the end of the replacement that was # just made: try: while line[pos] in (".", ":", "-", ")"): pos += 1 except IndexError: # The match was at the very end of the line pass return pos # Fill 'rebuilt_line' (the segment of the line that is being rebuilt to # include the tagged and standardised periodical TITLE) with the contents # of the reading-line, up to the point of the matched TITLE: rebuilt_line = reading_line[startpos:true_replacement_index] # Test to see whether a title or an "IBID" was matched: if journal_info.upper().find("IBID") != -1: # This is an IBID # Try to replace the IBID with a title: if previous_match: # Replace this IBID with the previous title match, if possible: rebuilt_line += add_tagged_journal_in_place_of_IBID(previous_match) series = previous_match['series'] # Update start position for next segment of original line: startpos = true_replacement_index + len(journal_info) + extras startpos = skip_ponctuation(reading_line, startpos) else: rebuilt_line = "" skip_numeration = True else: if ';' in standardised_titles[journal_info]: title, series = \ standardised_titles[journal_info].rsplit(';', 1) series = series.strip() previous_match = {'title': title, 'series': series} else: title = standardised_titles[journal_info] previous_match = {'title': title, 'series': None} # This is a normal title, not an IBID rebuilt_line += "<cds.JOURNAL>%s</cds.JOURNAL>" % title startpos = true_replacement_index + len(journal_info) + extras startpos = skip_ponctuation(reading_line, startpos) if not skip_numeration: # Check for numeration numeration_line = reading_line[startpos:] # First look for standard numeration numerotation_info = find_numeration(numeration_line) if not numerotation_info: numeration_line = rebuilt_line + " " + numeration_line # Now look for more funky numeration # With possibly some elements before the journal title numerotation_info = find_numeration_more(numeration_line) if not numerotation_info: startpos = old_startpos previous_match = old_previous_match rebuilt_line = "" else: if series and not numerotation_info['series']: numerotation_info['series'] = series startpos += numerotation_info['len'] rebuilt_line += create_numeration_tag(numerotation_info) previous_match['series'] = numerotation_info['series'] # return the rebuilt line-segment, the position (of the reading line) from # which the next part of the rebuilt line should be started, and the newly # updated previous match. return rebuilt_line, startpos, previous_match def add_tagged_publisher(reading_line, matched_publisher, startpos, true_replacement_index, extras, kb_publishers): """In rebuilding the line, add an identified periodical TITLE (standardised and tagged) into the line. @param reading_line: (string) The reference line before capitalization was performed, and before REPORT-NUMBERs and TITLEs were stripped out. @param len_title: (integer) the length of the matched TITLE. @param matched_title: (string) the matched TITLE text. @param previous_match: (string) the previous periodical TITLE citation to have been matched in the current reference line. It is used when replacing an IBID instance in the line. @param startpos: (integer) the pointer to the next position in the reading-line from which to start rebuilding. @param true_replacement_index: (integer) the replacement index of the matched TITLE in the reading-line, with stripped punctuation and whitespace accounted for. @param extras: (integer) extras to be added into the replacement index. @param standardised_titles: (dictionary) the standardised versions of periodical titles, keyed by their various non-standard versions. @return: (tuple) containing a string (the rebuilt line segment), an integer (the next 'startpos' in the reading-line), and an other string (the newly updated previous-match). """ # Fill 'rebuilt_line' (the segment of the line that is being rebuilt to # include the tagged and standardised periodical TITLE) with the contents # of the reading-line, up to the point of the matched TITLE: rebuilt_line = reading_line[startpos:true_replacement_index] # This is a normal title, not an IBID rebuilt_line += "<cds.PUBLISHER>%(title)s</cds.PUBLISHER>" \ % {'title' : kb_publishers[matched_publisher]['repl']} # Compute new start pos startpos = true_replacement_index + len(matched_publisher) + extras # return the rebuilt line-segment, the position (of the reading line) from # which the next part of the rebuilt line should be started, and the newly # updated previous match. return rebuilt_line, startpos def get_replacement_types(titles, reportnumbers, publishers): """Given the indices of the titles and reportnumbers that have been recognised within a reference line, create a dictionary keyed by the replacement position in the line, where the value for each key is a string describing the type of item replaced at that position in the line. The description strings are: 'title' - indicating that the replacement is a periodical title 'reportnumber' - indicating that the replacement is a preprint report number. @param titles: (list) of locations in the string at which periodical titles were found. @param reportnumbers: (list) of locations in the string at which reportnumbers were found. @return: (dictionary) of replacement types at various locations within the string. """ rep_types = {} for item_idx in titles: rep_types[item_idx] = "journal" for item_idx in reportnumbers: rep_types[item_idx] = "reportnumber" for item_idx in publishers: rep_types[item_idx] = "publisher" return rep_types def account_for_stripped_whitespace(spaces_keys, removed_spaces, replacement_types, len_reportnums, journals_matches, replacement_index): """To build a processed (MARC XML) reference line in which the recognised citations such as standardised periodical TITLEs and REPORT-NUMBERs have been marked up, it is necessary to read from the reference line BEFORE all punctuation was stripped and it was made into upper-case. The indices of the cited items in this 'original line', however, will be different to those in the 'working-line', in which punctuation and multiple-spaces were stripped out. For example, the following reading-line: [26] E. Witten and S.-T. Yau, hep-th/9910245. ...becomes (after punctuation and multiple white-space stripping): [26] E WITTEN AND S T YAU HEP TH/9910245 It can be seen that the report-number citation (hep-th/9910245) is at a different index in the two strings. When refextract searches for this citation, it uses the 2nd string (i.e. that which is capitalised and has no punctuation). When it builds the MARC XML representation of the reference line, however, it needs to read from the first string. It must therefore consider the whitespace, punctuation, etc that has been removed, in order to get the correct index for the cited item. This function accounts for the stripped characters before a given TITLE or REPORT-NUMBER index. @param spaces_keys: (list) - the indices at which spaces were removed from the reference line. @param removed_spaces: (dictionary) - keyed by the indices at which spaces were removed from the line, the values are the number of spaces actually removed from that position. So, for example, "3 spaces were removed from position 25 in the line." @param replacement_types: (dictionary) - at each 'replacement_index' in the line, the of replacement to make (title or reportnumber). @param len_reportnums: (dictionary) - the lengths of the REPORT- NUMBERs matched at the various indices in the line. @param len_titles: (dictionary) - the lengths of the various TITLEs matched at the various indices in the line. @param replacement_index: (integer) - the index in the working line of the identified TITLE or REPORT-NUMBER citation. @return: (tuple) containing 2 elements: + the true replacement index of a replacement in the reading line; + any extras to add into the replacement index; """ extras = 0 true_replacement_index = replacement_index spare_replacement_index = replacement_index for space in spaces_keys: if space < true_replacement_index: # There were spaces stripped before the current replacement # Add the number of spaces removed from this location to the # current replacement index: true_replacement_index += removed_spaces[space] spare_replacement_index += removed_spaces[space] elif space >= spare_replacement_index and \ replacement_types[replacement_index] == u"journal" and \ space < (spare_replacement_index + len(journals_matches[replacement_index])): # A periodical title is being replaced. Account for multi-spaces # that may have been stripped from the title before its # recognition: spare_replacement_index += removed_spaces[space] extras += removed_spaces[space] elif space >= spare_replacement_index and \ replacement_types[replacement_index] == u"reportnumber" and \ space < (spare_replacement_index + len_reportnums[replacement_index]): # An institutional preprint report-number is being replaced. # Account for multi-spaces that may have been stripped from it # before its recognition: spare_replacement_index += removed_spaces[space] extras += removed_spaces[space] # return the new values for replacement indices with stripped # whitespace accounted for: return true_replacement_index, extras def strip_tags(line): # Firstly, go through and change ALL TAGS and their contents to underscores # author content can be checked for underscores later on # Note that we don't have embedded tags this is why # we can do this re_tag = re.compile(ur'<cds\.[A-Z]+>[^<]*</cds\.[A-Z]+>|<cds\.[A-Z]+ />', re.UNICODE) for m in re_tag.finditer(line): chars_count = m.end() - m.start() line = re_tag.sub('_'*chars_count, line, count=1) return line def identify_and_tag_collaborations(line, collaborations_kb): """Given a line where Authors have been tagged, and all other tags and content has been replaced with underscores, go through and try to identify extra items of data which should be placed into 'h' subfields. Later on, these tagged pieces of information will be merged into the content of the most recently found author. This is separated from the author tagging procedure since separate tags can be used, which won't influence the reference splitting heuristics (used when looking at mulitple <AUTH> tags in a line). """ for dummy_collab, re_collab in collaborations_kb.iteritems(): matches = re_collab.finditer(strip_tags(line)) for match in reversed(list(matches)): line = line[:match.start()] \ + CFG_REFEXTRACT_MARKER_OPENING_COLLABORATION \ + match.group(1).strip(".,:;- [](){}") \ + CFG_REFEXTRACT_MARKER_CLOSING_COLLABORATION \ + line[match.end():] return line def identify_and_tag_authors(line, authors_kb): """Given a reference, look for a group of author names, place tags around the author group, return the newly tagged line. """ re_auth, re_auth_near_miss = get_author_regexps() # Replace authors which do not convert well from utf-8 for pattern, repl in authors_kb: line = line.replace(pattern, repl) output_line = line # We matched authors here line = strip_tags(output_line) matched_authors = list(re_auth.finditer(line)) # We try to have better results by unidecoding - if UNIDECODE_AVAILABLE: - unidecoded_line = strip_tags(unidecode(output_line)) - matched_authors_unidecode = list(re_auth.finditer(unidecoded_line)) + unidecoded_line = strip_tags(unidecode(output_line)) + matched_authors_unidecode = list(re_auth.finditer(unidecoded_line)) - if len(matched_authors_unidecode) > len(matched_authors): - output_line = unidecode(output_line) - matched_authors = matched_authors_unidecode + if len(matched_authors_unidecode) > len(matched_authors): + output_line = unidecode(output_line) + matched_authors = matched_authors_unidecode # If there is at least one matched author group if matched_authors: matched_positions = [] preceeding_text_string = line preceeding_text_start = 0 for auth_no, match in enumerate(matched_authors): # Only if there are no underscores or closing arrows found in the matched author group # This must be checked for here, as it cannot be applied to the re without clashing with # other Unicode characters if line[match.start():match.end()].find("_") == -1: # Has the group with name 'et' (for 'et al') been found in the pattern? # Has the group with name 'es' (for ed. before the author) been found in the pattern? # Has the group with name 'ee' (for ed. after the author) been found in the pattern? matched_positions.append({ 'start' : match.start(), 'end' : match.end(), 'etal' : match.group('et') or match.group('et2'), 'ed_start' : match.group('es'), 'ed_end' : match.group('ee'), 'multi_auth' : match.group('multi_auth'), 'multi_surs' : match.group('multi_surs'), 'text_before' : preceeding_text_string[preceeding_text_start:match.start()], 'auth_no' : auth_no, 'author_names': match.group('author_names') }) # Save the end of the match, from where to snip the misc text found before an author match preceeding_text_start = match.end() # Work backwards to avoid index problems when adding AUTH tags matched_positions.reverse() for m in matched_positions: dump_in_misc = False start = m['start'] end = m['end'] # Check the text before the current match to see if it has a bad 'et al' lower_text_before = m['text_before'].strip().lower() for e in etal_matches: if lower_text_before.endswith(e): ## If so, this author match is likely to be a bad match on a missed title dump_in_misc = True break # An AND found here likely indicates a missed author before this text # Thus, triggers weaker author searching, within the previous misc text # (Check the text before the current match to see if it has a bad 'and') # A bad 'and' will only be denoted as such if there exists only one author after it # and the author group is legit (not to be dumped in misc) if not dump_in_misc and not (m['multi_auth'] or m['multi_surs']) \ and (lower_text_before.endswith(' and')): # Search using a weaker author pattern to try and find the missed author(s) (cut away the end 'and') weaker_match = re_auth_near_miss.match(m['text_before']) if weaker_match and not (weaker_match.group('es') or weaker_match.group('ee')): # Change the start of the author group to include this new author group start = start - (len(m['text_before']) - weaker_match.start()) # Still no match, do not add tags for this author match.. dump it into misc else: dump_in_misc = True add_to_misc = "" # If a semi-colon was found at the end of this author group, keep it in misc # so that it can be looked at for splitting heurisitics if len(output_line) > m['end']: if output_line[m['end']].strip(" ,.") == ';': add_to_misc = ';' # Standardize eds. notation tmp_output_line = re.sub(re_ed_notation, '(ed.)', output_line[start:end], re.IGNORECASE) # Standardize et al. notation tmp_output_line = re.sub(re_etal, 'et al.', tmp_output_line, re.IGNORECASE) # Strip tmp_output_line = tmp_output_line.lstrip('.').strip(",:;- [](") if not tmp_output_line.endswith('(ed.)'): tmp_output_line = tmp_output_line.strip(')') # ONLY wrap author data with tags IF there is no evidence that it is an # ed. author. (i.e. The author is not referred to as an editor) # Does this author group string have 'et al.'? if m['etal'] and not (m['ed_start'] or m['ed_end'] or dump_in_misc): output_line = output_line[:start] \ + "<cds.AUTHetal>" \ + tmp_output_line \ + CFG_REFEXTRACT_MARKER_CLOSING_AUTHOR_ETAL \ + add_to_misc \ + output_line[end:] elif not (m['ed_start'] or m['ed_end'] or dump_in_misc): # Insert the std (standard) tag output_line = output_line[:start] \ + "<cds.AUTHstnd>" \ + tmp_output_line \ + CFG_REFEXTRACT_MARKER_CLOSING_AUTHOR_STND \ + add_to_misc \ + output_line[end:] # Apply the 'include in $h' method to author groups marked as editors elif m['ed_start'] or m['ed_end']: ed_notation = " (eds.)" # Standardize et al. notation tmp_output_line = re.sub(re_etal, 'et al.', m['author_names'], re.IGNORECASE) # remove any characters which denote this author group # to be editors, just take the # author names, and append '(ed.)' output_line = output_line[:start] \ + "<cds.AUTHincl>" \ + tmp_output_line.strip(",:;- [](") \ + ed_notation \ + CFG_REFEXTRACT_MARKER_CLOSING_AUTHOR_INCL \ + add_to_misc \ + output_line[end:] return output_line def sum_2_dictionaries(dicta, dictb): """Given two dictionaries of totals, where each total refers to a key in the dictionary, add the totals. E.g.: dicta = { 'a' : 3, 'b' : 1 } dictb = { 'a' : 1, 'c' : 5 } dicta + dictb = { 'a' : 4, 'b' : 1, 'c' : 5 } @param dicta: (dictionary) @param dictb: (dictionary) @return: (dictionary) - the sum of the 2 dictionaries """ dict_out = dicta.copy() for key in dictb.keys(): if 'key' in dict_out: # Add the sum for key in dictb to that of dict_out: dict_out[key] += dictb[key] else: # the key is not in the first dictionary - add it directly: dict_out[key] = dictb[key] return dict_out def identify_ibids(line): """Find IBIDs within the line, record their position and length, and replace them with underscores. @param line: (string) the working reference line @return: (tuple) containing 2 dictionaries and a string: Dictionary: matched IBID text: (Key: position of IBID in line; Value: matched IBID text) String: working line with matched IBIDs removed """ ibid_match_txt = {} # Record details of each matched ibid: for m_ibid in re_ibid.finditer(line): ibid_match_txt[m_ibid.start()] = m_ibid.group(0) # Replace matched text in line with underscores: line = line[0:m_ibid.start()] + \ "_" * len(m_ibid.group(0)) + \ line[m_ibid.end():] return ibid_match_txt, line def find_all(string, sub): listindex = [] offset = 0 i = string.find(sub, offset) while i >= 0: listindex.append(i) i = string.find(sub, i + 1) return listindex def find_numeration(line): """Given a reference line, attempt to locate instances of citation 'numeration' in the line. @param line: (string) the reference line. @return: (string) the reference line after numeration has been checked and possibly recognized/marked-up. """ patterns = ( # vol,page,year re_numeration_vol_page_yr, re_numeration_vol_nucphys_page_yr, re_numeration_nucphys_vol_page_yr, # With sub volume re_numeration_vol_subvol_nucphys_yr_page, re_numeration_vol_nucphys_yr_subvol_page, # vol,year,page re_numeration_vol_yr_page, re_numeration_nucphys_vol_yr_page, re_numeration_vol_nucphys_series_yr_page, # vol,page,year re_numeration_vol_series_nucphys_page_yr, re_numeration_vol_nucphys_series_page_yr, # year,vol,page re_numeration_yr_vol_page, ) for pattern in patterns: match = pattern.match(line) if match: info = match.groupdict() series = info.get('series', None) if not series: series = extract_series_from_volume(info['vol']) if not info['vol_num']: info['vol_num'] = info['vol_num_alt'] if not info['vol_num']: info['vol_num'] = info['vol_num_alt2'] return {'year': info.get('year', None), 'series': series, 'volume': info['vol_num'], 'page': info['page'], 'page_end': info['page_end'], 'len': match.end()} return None def identify_journals(line, kb_journals): """Attempt to identify all periodical titles in a reference line. Titles will be identified, their information (location in line, length in line, and non-standardised version) will be recorded, and they will be replaced in the working line by underscores. @param line: (string) - the working reference line. @param periodical_title_search_kb: (dictionary) - contains the regexp patterns used to search for a non-standard TITLE in the working reference line. Keyed by the TITLE string itself. @param periodical_title_search_keys: (list) - contains the non- standard periodical TITLEs to be searched for in the line. This list of titles has already been ordered and is used to force the order of searching. @return: (tuple) containing 4 elements: + (dictionary) - the lengths of all titles matched at each given index within the line. + (dictionary) - the text actually matched for each title at each given index within the line. + (string) - the working line, with the titles removed from it and replaced by underscores. + (dictionary) - the totals for each bad-title found in the line. """ periodical_title_search_kb = kb_journals[0] periodical_title_search_keys = kb_journals[2] title_matches = {} # the text matched at the given line # location (i.e. the title itself) titles_count = {} # sum totals of each 'bad title found in # line. # Begin searching: for title in periodical_title_search_keys: # search for all instances of the current periodical title # in the line: # for each matched periodical title: for title_match in periodical_title_search_kb[title].finditer(line): if title not in titles_count: # Add this title into the titles_count dictionary: titles_count[title] = 1 else: # Add 1 to the count for the given title: titles_count[title] += 1 # record the details of this title match: # record the match length: title_matches[title_match.start()] = title len_to_replace = len(title) # replace the matched title text in the line it n * '_', # where n is the length of the matched title: line = u"".join((line[:title_match.start()], u"_" * len_to_replace, line[title_match.start() + len_to_replace:])) # return recorded information about matched periodical titles, # along with the newly changed working line: return title_matches, line, titles_count def identify_report_numbers(line, kb_reports): """Attempt to identify all preprint report numbers in a reference line. Report numbers will be identified, their information (location in line, length in line, and standardised replacement version) will be recorded, and they will be replaced in the working-line by underscores. @param line: (string) - the working reference line. @param preprint_repnum_search_kb: (dictionary) - contains the regexp patterns used to identify preprint report numbers. @param preprint_repnum_standardised_categs: (dictionary) - contains the standardised 'category' of a given preprint report number. @return: (tuple) - 3 elements: * a dictionary containing the lengths in the line of the matched preprint report numbers, keyed by the index at which each match was found in the line. * a dictionary containing the replacement strings (standardised versions) of preprint report numbers that were matched in the line. * a string, that is the new version of the working reference line, in which any matched preprint report numbers have been replaced by underscores. Returned tuple is therefore in the following order: (matched-reportnum-lengths, matched-reportnum-replacements, working-line) """ def _by_len(a, b): """Comparison function used to sort a list by the length of the strings in each element of the list. """ if len(a[1]) < len(b[1]): return 1 elif len(a[1]) == len(b[1]): return 0 else: return -1 repnum_matches_matchlen = {} # info about lengths of report numbers # matched at given locations in line repnum_matches_repl_str = {} # standardised report numbers matched # at given locations in line repnum_search_kb, repnum_standardised_categs = kb_reports repnum_categs = repnum_standardised_categs.keys() repnum_categs.sort(_by_len) # Handle CERN/LHCC/98-013 line = line.replace('/', ' ') # try to match preprint report numbers in the line: for categ in repnum_categs: # search for all instances of the current report # numbering style in the line: repnum_matches_iter = repnum_search_kb[categ].finditer(line) # for each matched report number of this style: for repnum_match in repnum_matches_iter: # Get the matched text for the numeration part of the # preprint report number: numeration_match = repnum_match.group('numn') # clean/standardise this numeration text: numeration_match = numeration_match.replace(" ", "-") numeration_match = re_multiple_hyphens.sub("-", numeration_match) numeration_match = numeration_match.replace("/-", "/") numeration_match = numeration_match.replace("-/", "/") numeration_match = numeration_match.replace("-/-", "/") # replace the found preprint report number in the # string with underscores # (this will replace chars in the lower-cased line): line = line[0:repnum_match.start(1)] \ + "_"*len(repnum_match.group(1)) + line[repnum_match.end(1):] # record the information about the matched preprint report number: # total length in the line of the matched preprint report number: repnum_matches_matchlen[repnum_match.start(1)] = \ len(repnum_match.group(1)) # standardised replacement for the matched preprint report number: repnum_matches_repl_str[repnum_match.start(1)] = \ repnum_standardised_categs[categ] \ + numeration_match # return recorded information about matched report numbers, along with # the newly changed working line: return repnum_matches_matchlen, repnum_matches_repl_str, line def identify_publishers(line, kb_publishers): matches_repl = {} # standardised report numbers matched # at given locations in line for abbrev, info in kb_publishers.iteritems(): for match in info['pattern'].finditer(line): # record the matched non-standard version of the publisher: matches_repl[match.start(0)] = abbrev return matches_repl def identify_and_tag_URLs(line): """Given a reference line, identify URLs in the line, record the information about them, and replace them with a "<cds.URL />" tag. URLs are identified in 2 forms: + Raw: http://invenio-software.org/ + HTML marked-up: <a href="http://invenio-software.org/">CERN Document Server Software Consortium</a> These URLs are considered to have 2 components: The URL itself (url string); and the URL description. The description is effectively the text used for the created Hyperlink when the URL is marked-up in HTML. When an HTML marked-up URL has been recognised, the text between the anchor tags is therefore taken as the URL description. In the case of a raw URL recognition, however, the URL itself will also be used as the URL description. For example, in the following reference line: [1] See <a href="http://invenio-software.org/">CERN Document Server Software Consortium</a>. ...the URL string will be "http://invenio-software.org/" and the URL description will be "CERN Document Server Software Consortium". The line returned from this function will be: [1] See <cds.URL /> In the following line, however: [1] See http //invenio-software.org/ for more details. ...the URL string will be "http://invenio-software.org/" and the URL description will also be "http://invenio-software.org/". The line returned will be: [1] See <cds.URL /> for more details. @param line: (string) the reference line in which to search for URLs. @return: (tuple) - containing 2 items: + the line after URLs have been recognised and removed; + a list of 2-item tuples where each tuple represents a recognised URL and its description: [(url, url-description), (url, url-description), ... ] @Exceptions raised: + an IndexError if there is a problem with the number of URLs recognised (this should not happen.) """ # Take a copy of the line: line_pre_url_check = line # Dictionaries to record details of matched URLs: found_url_full_matchlen = {} found_url_urlstring = {} found_url_urldescr = {} # List to contain details of all matched URLs: identified_urls = [] # Attempt to identify and tag all HTML-MARKED-UP URLs in the line: m_tagged_url_iter = re_html_tagged_url.finditer(line) for m_tagged_url in m_tagged_url_iter: startposn = m_tagged_url.start() # start position of matched URL endposn = m_tagged_url.end() # end position of matched URL matchlen = len(m_tagged_url.group(0)) # total length of URL match found_url_full_matchlen[startposn] = matchlen found_url_urlstring[startposn] = m_tagged_url.group('url') found_url_urldescr[startposn] = m_tagged_url.group('desc') # temporarily replace the URL match with underscores so that # it won't be re-found line = line[0:startposn] + u"_"*matchlen + line[endposn:] # Attempt to identify and tag all RAW (i.e. not # HTML-marked-up) URLs in the line: m_raw_url_iter = re_raw_url.finditer(line) for m_raw_url in m_raw_url_iter: startposn = m_raw_url.start() # start position of matched URL endposn = m_raw_url.end() # end position of matched URL matchlen = len(m_raw_url.group(0)) # total length of URL match matched_url = m_raw_url.group('url') if len(matched_url) > 0 and matched_url[-1] in (".", ","): # Strip the full-stop or comma from the end of the url: matched_url = matched_url[:-1] found_url_full_matchlen[startposn] = matchlen found_url_urlstring[startposn] = matched_url found_url_urldescr[startposn] = matched_url # temporarily replace the URL match with underscores # so that it won't be re-found line = line[0:startposn] + u"_"*matchlen + line[endposn:] # Now that all URLs have been identified, insert them # back into the line, tagged: found_url_positions = found_url_urlstring.keys() found_url_positions.sort() found_url_positions.reverse() for url_position in found_url_positions: line = line[0:url_position] + "<cds.URL />" \ + line[url_position + found_url_full_matchlen[url_position]:] # The line has been rebuilt. Now record the information about the # matched URLs: found_url_positions = found_url_urlstring.keys() found_url_positions.sort() for url_position in found_url_positions: identified_urls.append((found_url_urlstring[url_position], found_url_urldescr[url_position])) # Somehow the number of URLs found doesn't match the number of # URLs recorded in "identified_urls". Raise an IndexError. msg = """Error: The number of URLs found in the reference line """ \ """does not match the number of URLs recorded in the """ \ """list of identified URLs!\nLine pre-URL checking: %s\n""" \ """Line post-URL checking: %s\n""" \ % (line_pre_url_check, line) assert len(identified_urls) == len(found_url_positions), msg # return the line containing the tagged URLs: return line, identified_urls def identify_and_tag_DOI(line): """takes a single citation line and attempts to locate any DOI references. DOI references are recognised in both http (url) format and also the standard DOI notation (DOI: ...) @param line: (string) the reference line in which to search for DOI's. @return: the tagged line and a list of DOI strings (if any) """ # Used to hold the DOI strings in the citation line doi_strings = [] # Run the DOI pattern on the line, returning the re.match objects matched_doi = re_doi.finditer(line) # For each match found in the line for match in reversed(list(matched_doi)): # Store the start and end position start = match.start() end = match.end() # Get the actual DOI string (remove the url part of the doi string) doi_phrase = match.group(6) # Replace the entire matched doi with a tag line = line[0:start] + "<cds.DOI />" + line[end:] # Add the single DOI string to the list of DOI strings doi_strings.append(doi_phrase) doi_strings.reverse() return line, doi_strings diff --git a/modules/miscutil/lib/textutils.py b/modules/miscutil/lib/textutils.py index 792251862..fa0156e27 100644 --- a/modules/miscutil/lib/textutils.py +++ b/modules/miscutil/lib/textutils.py @@ -1,775 +1,766 @@ # -*- coding: utf-8 -*- ## This file is part of Invenio. ## Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2013 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """ Functions useful for text wrapping (in a box) and indenting. """ __revision__ = "$Id$" import sys import re import textwrap import htmlentitydefs import invenio.template from invenio.config import CFG_ETCDIR try: import chardet CHARDET_AVAILABLE = True except ImportError: CHARDET_AVAILABLE = False -try: - from unidecode import unidecode - UNIDECODE_AVAILABLE = True -except ImportError: - UNIDECODE_AVAILABLE = False +from unidecode import unidecode CFG_LATEX_UNICODE_TRANSLATION_CONST = {} CFG_WRAP_TEXT_IN_A_BOX_STYLES = { '__DEFAULT' : { 'horiz_sep' : '*', 'max_col' : 72, 'min_col' : 40, 'tab_str' : ' ', 'tab_num' : 0, 'border' : ('**', '*', '**', '** ', ' **', '**', '*', '**'), 'prefix' : '\n', 'suffix' : '\n', 'break_long' : False, 'force_horiz' : False, }, 'squared' : { 'horiz_sep' : '-', 'border' : ('+', '-', '+', '| ', ' |', '+', '-', '+') }, 'double_sharp' : { 'horiz_sep' : '#', 'border' : ('##', '#', '##', '## ', ' ##', '##', '#', '##') }, 'single_sharp' : { 'horiz_sep' : '#', 'border' : ('#', '#', '#', '# ', ' #', '#', '#', '#') }, 'single_star' : { 'border' : ('*', '*', '*', '* ', ' *', '*', '*', '*',) }, 'double_star' : { }, 'no_border' : { 'horiz_sep' : '', 'border' : ('', '', '', '', '', '', '', ''), 'prefix' : '', 'suffix' : '' }, 'conclusion' : { 'border' : ('', '', '', '', '', '', '', ''), 'prefix' : '', 'horiz_sep' : '-', 'force_horiz' : True, }, 'important' : { 'tab_num' : 1, }, 'ascii' : { 'horiz_sep' : (u'├', u'─', u'┤'), 'border' : (u'┌', u'─', u'┐', u'│ ', u' │', u'└', u'─', u'┘'), }, 'ascii_double' : { 'horiz_sep' : (u'╠', u'═', u'╣'), 'border' : (u'╔', u'═', u'╗', u'║ ', u' ║', u'╚', u'═', u'╝'), } } re_unicode_lowercase_a = re.compile(unicode(r"(?u)[áàäâãå]", "utf-8")) re_unicode_lowercase_ae = re.compile(unicode(r"(?u)[æ]", "utf-8")) re_unicode_lowercase_oe = re.compile(unicode(r"(?u)[œ]", "utf-8")) re_unicode_lowercase_e = re.compile(unicode(r"(?u)[éèëê]", "utf-8")) re_unicode_lowercase_i = re.compile(unicode(r"(?u)[íìïî]", "utf-8")) re_unicode_lowercase_o = re.compile(unicode(r"(?u)[óòöôõø]", "utf-8")) re_unicode_lowercase_u = re.compile(unicode(r"(?u)[úùüû]", "utf-8")) re_unicode_lowercase_y = re.compile(unicode(r"(?u)[ýÿ]", "utf-8")) re_unicode_lowercase_c = re.compile(unicode(r"(?u)[çć]", "utf-8")) re_unicode_lowercase_n = re.compile(unicode(r"(?u)[ñ]", "utf-8")) re_unicode_lowercase_ss = re.compile(unicode(r"(?u)[ß]", "utf-8")) re_unicode_uppercase_a = re.compile(unicode(r"(?u)[ÁÀÄÂÃÅ]", "utf-8")) re_unicode_uppercase_ae = re.compile(unicode(r"(?u)[Æ]", "utf-8")) re_unicode_uppercase_oe = re.compile(unicode(r"(?u)[Œ]", "utf-8")) re_unicode_uppercase_e = re.compile(unicode(r"(?u)[ÉÈËÊ]", "utf-8")) re_unicode_uppercase_i = re.compile(unicode(r"(?u)[ÍÌÏÎ]", "utf-8")) re_unicode_uppercase_o = re.compile(unicode(r"(?u)[ÓÒÖÔÕØ]", "utf-8")) re_unicode_uppercase_u = re.compile(unicode(r"(?u)[ÚÙÜÛ]", "utf-8")) re_unicode_uppercase_y = re.compile(unicode(r"(?u)[Ý]", "utf-8")) re_unicode_uppercase_c = re.compile(unicode(r"(?u)[ÇĆ]", "utf-8")) re_unicode_uppercase_n = re.compile(unicode(r"(?u)[Ñ]", "utf-8")) re_latex_lowercase_a = re.compile("\\\\[\"H'`~^vu=k]\{?a\}?") re_latex_lowercase_ae = re.compile("\\\\ae\\{\\}?") re_latex_lowercase_oe = re.compile("\\\\oe\\{\\}?") re_latex_lowercase_e = re.compile("\\\\[\"H'`~^vu=k]\\{?e\\}?") re_latex_lowercase_i = re.compile("\\\\[\"H'`~^vu=k]\\{?i\\}?") re_latex_lowercase_o = re.compile("\\\\[\"H'`~^vu=k]\\{?o\\}?") re_latex_lowercase_u = re.compile("\\\\[\"H'`~^vu=k]\\{?u\\}?") re_latex_lowercase_y = re.compile("\\\\[\"']\\{?y\\}?") re_latex_lowercase_c = re.compile("\\\\['uc]\\{?c\\}?") re_latex_lowercase_n = re.compile("\\\\[c'~^vu]\\{?n\\}?") re_latex_uppercase_a = re.compile("\\\\[\"H'`~^vu=k]\\{?A\\}?") re_latex_uppercase_ae = re.compile("\\\\AE\\{?\\}?") re_latex_uppercase_oe = re.compile("\\\\OE\\{?\\}?") re_latex_uppercase_e = re.compile("\\\\[\"H'`~^vu=k]\\{?E\\}?") re_latex_uppercase_i = re.compile("\\\\[\"H'`~^vu=k]\\{?I\\}?") re_latex_uppercase_o = re.compile("\\\\[\"H'`~^vu=k]\\{?O\\}?") re_latex_uppercase_u = re.compile("\\\\[\"H'`~^vu=k]\\{?U\\}?") re_latex_uppercase_y = re.compile("\\\\[\"']\\{?Y\\}?") re_latex_uppercase_c = re.compile("\\\\['uc]\\{?C\\}?") re_latex_uppercase_n = re.compile("\\\\[c'~^vu]\\{?N\\}?") def indent_text(text, nb_tabs=0, tab_str=" ", linebreak_input="\n", linebreak_output="\n", wrap=False): """ add tabs to each line of text @param text: the text to indent @param nb_tabs: number of tabs to add @param tab_str: type of tab (could be, for example "\t", default: 2 spaces @param linebreak_input: linebreak on input @param linebreak_output: linebreak on output @param wrap: wethever to apply smart text wrapping. (by means of wrap_text_in_a_box) @return: indented text as string """ if not wrap: lines = text.split(linebreak_input) tabs = nb_tabs*tab_str output = "" for line in lines: output += tabs + line + linebreak_output return output else: return wrap_text_in_a_box(body=text, style='no_border', tab_str=tab_str, tab_num=nb_tabs) _RE_BEGINNING_SPACES = re.compile(r'^\s*') _RE_NEWLINES_CLEANER = re.compile(r'\n+') _RE_LONELY_NEWLINES = re.compile(r'\b\n\b') def wrap_text_in_a_box(body='', title='', style='double_star', **args): """Return a nicely formatted text box: e.g. ****************** ** title ** **--------------** ** body ** ****************** Indentation and newline are respected. @param body: the main text @param title: an optional title @param style: the name of one of the style in CFG_WRAP_STYLES. By default the double_star style is used. You can further tune the desired style by setting various optional parameters: @param horiz_sep: a string that is repeated in order to produce a separator row between the title and the body (if needed) or a tuple of three characters in the form (l, c, r) @param max_col: the maximum number of coulmns used by the box (including indentation) @param min_col: the symmetrical minimum number of columns @param tab_str: a string to represent indentation @param tab_num: the number of leveles of indentations @param border: a tuple of 8 element in the form (tl, t, tr, l, r, bl, b, br) of strings that represent the different corners and sides of the box @param prefix: a prefix string added before the box @param suffix: a suffix string added after the box @param break_long: wethever to break long words in order to respect max_col @param force_horiz: True in order to print the horizontal line even when there is no title e.g.: print wrap_text_in_a_box(title='prova', body=' 123 prova.\n Vediamo come si indenta', horiz_sep='-', style='no_border', max_col=20, tab_num=1) prova ---------------- 123 prova. Vediamo come si indenta """ def _wrap_row(row, max_col, break_long): """Wrap a single row""" spaces = _RE_BEGINNING_SPACES.match(row).group() row = row[len(spaces):] spaces = spaces.expandtabs() return textwrap.wrap(row, initial_indent=spaces, subsequent_indent=spaces, width=max_col, break_long_words=break_long) def _clean_newlines(text): text = _RE_LONELY_NEWLINES.sub(' \n', text) return _RE_NEWLINES_CLEANER.sub(lambda x: x.group()[:-1], text) body = unicode(body, 'utf-8') title = unicode(title, 'utf-8') astyle = dict(CFG_WRAP_TEXT_IN_A_BOX_STYLES['__DEFAULT']) if CFG_WRAP_TEXT_IN_A_BOX_STYLES.has_key(style): astyle.update(CFG_WRAP_TEXT_IN_A_BOX_STYLES[style]) astyle.update(args) horiz_sep = astyle['horiz_sep'] border = astyle['border'] tab_str = astyle['tab_str'] * astyle['tab_num'] max_col = max(astyle['max_col'] \ - len(border[3]) - len(border[4]) - len(tab_str), 1) min_col = astyle['min_col'] prefix = astyle['prefix'] suffix = astyle['suffix'] force_horiz = astyle['force_horiz'] break_long = astyle['break_long'] body = _clean_newlines(body) tmp_rows = [_wrap_row(row, max_col, break_long) for row in body.split('\n')] body_rows = [] for rows in tmp_rows: if rows: body_rows += rows else: body_rows.append('') if not ''.join(body_rows).strip(): # Concrete empty body body_rows = [] title = _clean_newlines(title) tmp_rows = [_wrap_row(row, max_col, break_long) for row in title.split('\n')] title_rows = [] for rows in tmp_rows: if rows: title_rows += rows else: title_rows.append('') if not ''.join(title_rows).strip(): # Concrete empty title title_rows = [] max_col = max([len(row) for row in body_rows + title_rows] + [min_col]) mid_top_border_len = max_col \ + len(border[3]) + len(border[4]) - len(border[0]) - len(border[2]) mid_bottom_border_len = max_col \ + len(border[3]) + len(border[4]) - len(border[5]) - len(border[7]) top_border = border[0] \ + (border[1] * mid_top_border_len)[:mid_top_border_len] + border[2] bottom_border = border[5] \ + (border[6] * mid_bottom_border_len)[:mid_bottom_border_len] \ + border[7] if type(horiz_sep) is tuple and len(horiz_sep) == 3: horiz_line = horiz_sep[0] + (horiz_sep[1] * (max_col + 2))[:(max_col + 2)] + horiz_sep[2] else: horiz_line = border[3] + (horiz_sep * max_col)[:max_col] + border[4] title_rows = [tab_str + border[3] + row + ' ' * (max_col - len(row)) + border[4] for row in title_rows] body_rows = [tab_str + border[3] + row + ' ' * (max_col - len(row)) + border[4] for row in body_rows] ret = [] if top_border: ret += [tab_str + top_border] ret += title_rows if title_rows or force_horiz: ret += [tab_str + horiz_line] ret += body_rows if bottom_border: ret += [tab_str + bottom_border] return (prefix + '\n'.join(ret) + suffix).encode('utf-8') def wait_for_user(msg=""): """ Print MSG and a confirmation prompt, waiting for user's confirmation, unless silent '--yes-i-know' command line option was used, in which case the function returns immediately without printing anything. """ if '--yes-i-know' in sys.argv: return print msg try: answer = raw_input("Please confirm by typing 'Yes, I know!': ") except KeyboardInterrupt: print answer = '' if answer != 'Yes, I know!': sys.stderr.write("ERROR: Aborted.\n") sys.exit(1) return def guess_minimum_encoding(text, charsets=('ascii', 'latin1', 'utf8')): """Try to guess the minimum charset that is able to represent the given text using the provided charsets. text is supposed to be encoded in utf8. Returns (encoded_text, charset) where charset is the first charset in the sequence being able to encode text. Returns (text_in_utf8, 'utf8') in case no charset is able to encode text. @note: If the input text is not in strict UTF-8, then replace any non-UTF-8 chars inside it. """ text_in_unicode = text.decode('utf8', 'replace') for charset in charsets: try: return (text_in_unicode.encode(charset), charset) except (UnicodeEncodeError, UnicodeDecodeError): pass return (text_in_unicode.encode('utf8'), 'utf8') def encode_for_xml(text, wash=False, xml_version='1.0', quote=False): """Encodes special characters in a text so that it would be XML-compliant. @param text: text to encode @return: an encoded text""" text = text.replace('&', '&') text = text.replace('<', '<') if quote: text = text.replace('"', '"') if wash: text = wash_for_xml(text, xml_version=xml_version) return text try: unichr(0x100000) RE_ALLOWED_XML_1_0_CHARS = re.compile(u'[^\U00000009\U0000000A\U0000000D\U00000020-\U0000D7FF\U0000E000-\U0000FFFD\U00010000-\U0010FFFF]') RE_ALLOWED_XML_1_1_CHARS = re.compile(u'[^\U00000001-\U0000D7FF\U0000E000-\U0000FFFD\U00010000-\U0010FFFF]') except ValueError: # oops, we are running on a narrow UTF/UCS Python build, # so we have to limit the UTF/UCS char range: RE_ALLOWED_XML_1_0_CHARS = re.compile(u'[^\U00000009\U0000000A\U0000000D\U00000020-\U0000D7FF\U0000E000-\U0000FFFD]') RE_ALLOWED_XML_1_1_CHARS = re.compile(u'[^\U00000001-\U0000D7FF\U0000E000-\U0000FFFD]') def wash_for_xml(text, xml_version='1.0'): """ Removes any character which is not in the range of allowed characters for XML. The allowed characters depends on the version of XML. - XML 1.0: <http://www.w3.org/TR/REC-xml/#charsets> - XML 1.1: <http://www.w3.org/TR/xml11/#charsets> @param text: input string to wash. @param xml_version: version of the XML for which we wash the input. Value for this parameter can be '1.0' or '1.1' """ if xml_version == '1.0': return RE_ALLOWED_XML_1_0_CHARS.sub('', unicode(text, 'utf-8')).encode('utf-8') else: return RE_ALLOWED_XML_1_1_CHARS.sub('', unicode(text, 'utf-8')).encode('utf-8') def wash_for_utf8(text, correct=True): """Return UTF-8 encoded binary string with incorrect characters washed away. @param text: input string to wash (can be either a binary string or a Unicode string) @param correct: whether to correct bad characters or throw exception """ if isinstance(text, unicode): return text.encode('utf-8') errors = "ignore" if correct else "strict" return text.decode("utf-8", errors).encode("utf-8", errors) def nice_size(size): """ @param size: the size. @type size: int @return: a nicely printed size. @rtype: string """ websearch_templates = invenio.template.load('websearch') unit = 'B' if size > 1024: size /= 1024.0 unit = 'KB' if size > 1024: size /= 1024.0 unit = 'MB' if size > 1024: size /= 1024.0 unit = 'GB' return '%s %s' % (websearch_templates.tmpl_nice_number(size, max_ndigits_after_dot=2), unit) def remove_line_breaks(text): """ Remove line breaks from input, including unicode 'line separator', 'paragraph separator', and 'next line' characters. """ return unicode(text, 'utf-8').replace('\f', '').replace('\n', '').replace('\r', '').replace(u'\xe2\x80\xa8', '').replace(u'\xe2\x80\xa9', '').replace(u'\xc2\x85', '').encode('utf-8') def decode_to_unicode(text, default_encoding='utf-8'): """ Decode input text into Unicode representation by first using the default encoding utf-8. If the operation fails, it detects the type of encoding used in the given text. For optimal result, it is recommended that the 'chardet' module is installed. NOTE: Beware that this might be slow for *very* large strings. If chardet detection fails, it will try to decode the string using the basic detection function guess_minimum_encoding(). Also, bear in mind that it is impossible to detect the correct encoding at all times, other then taking educated guesses. With that said, this function will always return some decoded Unicode string, however the data returned may not be the same as original data in some cases. @param text: the text to decode @type text: string @param default_encoding: the character encoding to use. Optional. @type default_encoding: string @return: input text as Unicode @rtype: string """ if not text: return "" try: return text.decode(default_encoding) except (UnicodeError, LookupError): pass detected_encoding = None if CHARDET_AVAILABLE: # We can use chardet to perform detection res = chardet.detect(text) if res['confidence'] >= 0.8: detected_encoding = res['encoding'] if detected_encoding == None: # No chardet detection, try to make a basic guess dummy, detected_encoding = guess_minimum_encoding(text) return text.decode(detected_encoding) def translate_latex2unicode(text, kb_file="%s/bibconvert/KB/latex-to-unicode.kb" % \ (CFG_ETCDIR,)): """ This function will take given text, presumably containing LaTeX symbols, and attempts to translate it to Unicode using the given or default KB translation table located under CFG_ETCDIR/bibconvert/KB/latex-to-unicode.kb. The translated Unicode string will then be returned. If the translation table and compiled regular expression object is not previously generated in the current session, they will be. @param text: a text presumably containing LaTeX symbols. @type text: string @param kb_file: full path to file containing latex2unicode translations. Defaults to CFG_ETCDIR/bibconvert/KB/latex-to-unicode.kb @type kb_file: string @return: Unicode representation of translated text @rtype: unicode """ # First decode input text to Unicode try: text = decode_to_unicode(text) except UnicodeDecodeError: text = unicode(wash_for_utf8(text)) # Load translation table, if required if CFG_LATEX_UNICODE_TRANSLATION_CONST == {}: _load_latex2unicode_constants(kb_file) # Find all matches and replace text for match in CFG_LATEX_UNICODE_TRANSLATION_CONST['regexp_obj'].finditer(text): # If LaTeX style markers {, } and $ are before or after the matching text, it # will replace those as well text = re.sub("[\{\$]?%s[\}\$]?" % (re.escape(match.group()),), \ CFG_LATEX_UNICODE_TRANSLATION_CONST['table'][match.group()], \ text) # Return Unicode representation of translated text return text def _load_latex2unicode_constants(kb_file="%s/bibconvert/KB/latex-to-unicode.kb" % \ (CFG_ETCDIR,)): """ Load LaTeX2Unicode translation table dictionary and regular expression object from KB to a global dictionary. @param kb_file: full path to file containing latex2unicode translations. Defaults to CFG_ETCDIR/bibconvert/KB/latex-to-unicode.kb @type kb_file: string @return: dict of type: {'regexp_obj': regexp match object, 'table': dict of LaTeX -> Unicode mappings} @rtype: dict """ try: data = open(kb_file) except IOError: # File not found or similar sys.stderr.write("\nCould not open LaTeX to Unicode KB file. Aborting translation.\n") return CFG_LATEX_UNICODE_TRANSLATION_CONST latex_symbols = [] translation_table = {} for line in data: # The file has form of latex|--|utf-8. First decode to Unicode. line = line.decode('utf-8') mapping = line.split('|--|') translation_table[mapping[0].rstrip('\n')] = mapping[1].rstrip('\n') latex_symbols.append(re.escape(mapping[0].rstrip('\n'))) data.close() CFG_LATEX_UNICODE_TRANSLATION_CONST['regexp_obj'] = re.compile("|".join(latex_symbols)) CFG_LATEX_UNICODE_TRANSLATION_CONST['table'] = translation_table def translate_to_ascii(values): """ Transliterate the string contents of the given sequence into ascii representation. Returns a sequence with the modified values if the module 'unidecode' is available. Otherwise it will fall back to the inferior strip_accents function. For example: H\xc3\xb6hne becomes Hohne. Note: Passed strings are returned as a list. @param values: sequence of strings to transform @type values: sequence @return: sequence with values transformed to ascii @rtype: sequence """ if not values and not type(values) == str: return values if type(values) == str: values = [values] for index, value in enumerate(values): if not value: continue - if not UNIDECODE_AVAILABLE: - ascii_text = strip_accents(value) - else: - encoded_text, encoding = guess_minimum_encoding(value) - unicode_text = unicode(encoded_text.decode(encoding)) - decoded_text = "" + unicode_text = decode_to_unicode(value) + if u"[?]" in unicode_text: + decoded_text = [] for unicode_char in unicode_text: decoded_char = unidecode(unicode_char) # Skip unrecognized characters if decoded_char != "[?]": - decoded_text += decoded_char - ascii_text = decoded_text.encode('ascii') + decoded_text.append(decoded_char) + ascii_text = ''.join(decoded_text).encode('ascii') + else: + ascii_text = unidecode(unicode_text).replace(u"[?]", u"").encode('ascii') values[index] = ascii_text return values def xml_entities_to_utf8(text, skip=('lt', 'gt', 'amp')): """ Removes HTML or XML character references and entities from a text string and replaces them with their UTF-8 representation, if possible. @param text: The HTML (or XML) source text. @type text: string @param skip: list of entity names to skip when transforming. @type skip: iterable @return: The plain text, as a Unicode string, if necessary. @author: Based on http://effbot.org/zone/re-sub.htm#unescape-html """ def fixup(m): text = m.group(0) if text[:2] == "&#": # character reference try: if text[:3] == "&#x": return unichr(int(text[3:-1], 16)).encode("utf-8") else: return unichr(int(text[2:-1])).encode("utf-8") except ValueError: pass else: # named entity if text[1:-1] not in skip: try: text = unichr(htmlentitydefs.name2codepoint[text[1:-1]]).encode("utf-8") except KeyError: pass return text # leave as is return re.sub("&#?\w+;", fixup, text) def strip_accents(x): """ Strip accents in the input phrase X (assumed in UTF-8) by replacing accented characters with their unaccented cousins (e.g. é by e). @param x: the input phrase to strip. @type x: string @return: Return such a stripped X. """ x = re_latex_lowercase_a.sub("a", x) x = re_latex_lowercase_ae.sub("ae", x) x = re_latex_lowercase_oe.sub("oe", x) x = re_latex_lowercase_e.sub("e", x) x = re_latex_lowercase_i.sub("i", x) x = re_latex_lowercase_o.sub("o", x) x = re_latex_lowercase_u.sub("u", x) x = re_latex_lowercase_y.sub("x", x) x = re_latex_lowercase_c.sub("c", x) x = re_latex_lowercase_n.sub("n", x) x = re_latex_uppercase_a.sub("A", x) x = re_latex_uppercase_ae.sub("AE", x) x = re_latex_uppercase_oe.sub("OE", x) x = re_latex_uppercase_e.sub("E", x) x = re_latex_uppercase_i.sub("I", x) x = re_latex_uppercase_o.sub("O", x) x = re_latex_uppercase_u.sub("U", x) x = re_latex_uppercase_y.sub("Y", x) x = re_latex_uppercase_c.sub("C", x) x = re_latex_uppercase_n.sub("N", x) # convert input into Unicode string: try: y = unicode(x, "utf-8") except: return x # something went wrong, probably the input wasn't UTF-8 # asciify Latin-1 lowercase characters: y = re_unicode_lowercase_a.sub("a", y) y = re_unicode_lowercase_ae.sub("ae", y) y = re_unicode_lowercase_oe.sub("oe", y) y = re_unicode_lowercase_e.sub("e", y) y = re_unicode_lowercase_i.sub("i", y) y = re_unicode_lowercase_o.sub("o", y) y = re_unicode_lowercase_u.sub("u", y) y = re_unicode_lowercase_y.sub("y", y) y = re_unicode_lowercase_c.sub("c", y) y = re_unicode_lowercase_n.sub("n", y) y = re_unicode_lowercase_ss.sub("ss", y) # asciify Latin-1 uppercase characters: y = re_unicode_uppercase_a.sub("A", y) y = re_unicode_uppercase_ae.sub("AE", y) y = re_unicode_uppercase_oe.sub("OE", y) y = re_unicode_uppercase_e.sub("E", y) y = re_unicode_uppercase_i.sub("I", y) y = re_unicode_uppercase_o.sub("O", y) y = re_unicode_uppercase_u.sub("U", y) y = re_unicode_uppercase_y.sub("Y", y) y = re_unicode_uppercase_c.sub("C", y) y = re_unicode_uppercase_n.sub("N", y) # return UTF-8 representation of the Unicode string: return y.encode("utf-8") def show_diff(original, modified, prefix='', suffix='', prefix_unchanged=' ', suffix_unchanged='', prefix_removed='-', suffix_removed='', prefix_added='+', suffix_added=''): """ Returns the diff view between original and modified strings. Function checks both arguments line by line and returns a string with a: - prefix_unchanged when line is common to both sequences - prefix_removed when line is unique to sequence 1 - prefix_added when line is unique to sequence 2 and a corresponding suffix in each line @param original: base string @param modified: changed string @param prefix: prefix of the output string @param suffix: suffix of the output string @param prefix_unchanged: prefix of the unchanged line @param suffix_unchanged: suffix of the unchanged line @param prefix_removed: prefix of the removed line @param suffix_removed: suffix of the removed line @param prefix_added: prefix of the added line @param suffix_added: suffix of the added line @return: string with the comparison of the records @rtype: string """ import difflib differ = difflib.Differ() result = [prefix] for line in differ.compare(modified.splitlines(), original.splitlines()): if line[0] == ' ': # Mark as unchanged result.append(prefix_unchanged + line[2:].strip() + suffix_unchanged) elif line[0] == '-': # Mark as removed result.append(prefix_removed + line[2:].strip() + suffix_removed) elif line[0] == '+': # Mark as added/modified result.append(prefix_added + line[2:].strip() + suffix_added) result.append(suffix) return '\n'.join(result) def transliterate_ala_lc(value): """ Transliterate a string. Compatibility with the ALA-LC romanization standard: http://www.loc.gov/catdir/cpso/roman.html Maps from one system of writing into another, letter by letter. Uses 'unidecode' if available. @param values: string to transform @type values: string @return: string transliterated @rtype: string """ if not value: return value - if UNIDECODE_AVAILABLE: - text = unidecode(value) - else: - text = translate_to_ascii(value) - text = text.pop() + text = unidecode(value) return text def escape_latex(text): """ This function takes the given text and escapes characters that have a special meaning in LaTeX: # $ % ^ & _ { } ~ \ """ text = unicode(text.decode('utf-8')) CHARS = { '&': r'\&', '%': r'\%', '$': r'\$', '#': r'\#', '_': r'\_', '{': r'\{', '}': r'\}', '~': r'\~{}', '^': r'\^{}', '\\': r'\textbackslash{}', } escaped = "".join([CHARS.get(char, char) for char in text]) return escaped.encode('utf-8') diff --git a/modules/miscutil/lib/textutils_unit_tests.py b/modules/miscutil/lib/textutils_unit_tests.py index 66bdf981b..9a7a93145 100644 --- a/modules/miscutil/lib/textutils_unit_tests.py +++ b/modules/miscutil/lib/textutils_unit_tests.py @@ -1,589 +1,581 @@ # -*- coding: utf-8 -*- ## ## This file is part of Invenio. ## Copyright (C) 2008, 2009, 2010, 2011, 2013 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """Unit tests for the textutils library.""" __revision__ = "$Id$" from invenio.testutils import InvenioTestCase try: import chardet CHARDET_AVAILABLE = True except ImportError: CHARDET_AVAILABLE = False -try: - from unidecode import unidecode - UNIDECODE_AVAILABLE = True -except ImportError: - UNIDECODE_AVAILABLE = False +from unidecode import unidecode from invenio.textutils import \ wrap_text_in_a_box, \ guess_minimum_encoding, \ wash_for_xml, \ wash_for_utf8, \ decode_to_unicode, \ translate_latex2unicode, \ translate_to_ascii, \ strip_accents, \ transliterate_ala_lc, \ escape_latex, \ show_diff from invenio.testutils import make_test_suite, run_test_suite class GuessMinimumEncodingTest(InvenioTestCase): """Test functions related to guess_minimum_encoding function.""" def test_guess_minimum_encoding(self): """textutils - guess_minimum_encoding.""" self.assertEqual(guess_minimum_encoding('patata'), ('patata', 'ascii')) self.assertEqual(guess_minimum_encoding('àèéìòù'), ('\xe0\xe8\xe9\xec\xf2\xf9', 'latin1')) self.assertEqual(guess_minimum_encoding('Ιθάκη'), ('Ιθάκη', 'utf8')) class WashForXMLTest(InvenioTestCase): """Test functions related to wash_for_xml function.""" def test_latin_characters_washing_1_0(self): """textutils - washing latin characters for XML 1.0.""" self.assertEqual(wash_for_xml('àèéìòùÀ'), 'àèéìòùÀ') def test_latin_characters_washing_1_1(self): """textutils - washing latin characters for XML 1.1.""" self.assertEqual(wash_for_xml('àèéìòùÀ', xml_version='1.1'), 'àèéìòùÀ') def test_chinese_characters_washing_1_0(self): """textutils - washing chinese characters for XML 1.0.""" self.assertEqual(wash_for_xml(''' 春眠暁を覚えず 処処に啼鳥と聞く 夜来風雨の声 花落つること 知んぬ多少ぞ'''), ''' 春眠暁を覚えず 処処に啼鳥と聞く 夜来風雨の声 花落つること 知んぬ多少ぞ''') def test_chinese_characters_washing_1_1(self): """textutils - washing chinese characters for XML 1.1.""" self.assertEqual(wash_for_xml(''' 春眠暁を覚えず 処処に啼鳥と聞く 夜来風雨の声 花落つること 知んぬ多少ぞ''', xml_version='1.1'), ''' 春眠暁を覚えず 処処に啼鳥と聞く 夜来風雨の声 花落つること 知んぬ多少ぞ''') def test_greek_characters_washing_1_0(self): """textutils - washing greek characters for XML 1.0.""" self.assertEqual(wash_for_xml(''' ἄνδρα μοι ἔννεπε, μου̂σα, πολύτροπον, ὃς μάλα πολλὰ πλάγχθη, ἐπεὶ Τροίης ἱερὸν πτολίεθρον ἔπερσεν: πολλω̂ν δ' ἀνθρώπων ἴδεν ἄστεα καὶ νόον ἔγνω, πολλὰ δ' ὅ γ' ἐν πόντῳ πάθεν ἄλγεα ὃν κατὰ θυμόν, ἀρνύμενος ἥν τε ψυχὴν καὶ νόστον ἑταίρων. ἀλλ' οὐδ' ὣς ἑτάρους ἐρρύσατο, ἱέμενός περ: αὐτω̂ν γὰρ σφετέρῃσιν ἀτασθαλίῃσιν ὄλοντο, νήπιοι, οἳ κατὰ βου̂ς ̔Υπερίονος ̓Ηελίοιο ἤσθιον: αὐτὰρ ὁ τοι̂σιν ἀφείλετο νόστιμον ἠ̂μαρ. τω̂ν ἁμόθεν γε, θεά, θύγατερ Διός, εἰπὲ καὶ ἡμι̂ν.'''), ''' ἄνδρα μοι ἔννεπε, μου̂σα, πολύτροπον, ὃς μάλα πολλὰ πλάγχθη, ἐπεὶ Τροίης ἱερὸν πτολίεθρον ἔπερσεν: πολλω̂ν δ' ἀνθρώπων ἴδεν ἄστεα καὶ νόον ἔγνω, πολλὰ δ' ὅ γ' ἐν πόντῳ πάθεν ἄλγεα ὃν κατὰ θυμόν, ἀρνύμενος ἥν τε ψυχὴν καὶ νόστον ἑταίρων. ἀλλ' οὐδ' ὣς ἑτάρους ἐρρύσατο, ἱέμενός περ: αὐτω̂ν γὰρ σφετέρῃσιν ἀτασθαλίῃσιν ὄλοντο, νήπιοι, οἳ κατὰ βου̂ς ̔Υπερίονος ̓Ηελίοιο ἤσθιον: αὐτὰρ ὁ τοι̂σιν ἀφείλετο νόστιμον ἠ̂μαρ. τω̂ν ἁμόθεν γε, θεά, θύγατερ Διός, εἰπὲ καὶ ἡμι̂ν.''') def test_greek_characters_washing_1_1(self): """textutils - washing greek characters for XML 1.1.""" self.assertEqual(wash_for_xml(''' ἄνδρα μοι ἔννεπε, μου̂σα, πολύτροπον, ὃς μάλα πολλὰ πλάγχθη, ἐπεὶ Τροίης ἱερὸν πτολίεθρον ἔπερσεν: πολλω̂ν δ' ἀνθρώπων ἴδεν ἄστεα καὶ νόον ἔγνω, πολλὰ δ' ὅ γ' ἐν πόντῳ πάθεν ἄλγεα ὃν κατὰ θυμόν, ἀρνύμενος ἥν τε ψυχὴν καὶ νόστον ἑταίρων. ἀλλ' οὐδ' ὣς ἑτάρους ἐρρύσατο, ἱέμενός περ: αὐτω̂ν γὰρ σφετέρῃσιν ἀτασθαλίῃσιν ὄλοντο, νήπιοι, οἳ κατὰ βου̂ς ̔Υπερίονος ̓Ηελίοιο ἤσθιον: αὐτὰρ ὁ τοι̂σιν ἀφείλετο νόστιμον ἠ̂μαρ. τω̂ν ἁμόθεν γε, θεά, θύγατερ Διός, εἰπὲ καὶ ἡμι̂ν.''', xml_version='1.1'), ''' ἄνδρα μοι ἔννεπε, μου̂σα, πολύτροπον, ὃς μάλα πολλὰ πλάγχθη, ἐπεὶ Τροίης ἱερὸν πτολίεθρον ἔπερσεν: πολλω̂ν δ' ἀνθρώπων ἴδεν ἄστεα καὶ νόον ἔγνω, πολλὰ δ' ὅ γ' ἐν πόντῳ πάθεν ἄλγεα ὃν κατὰ θυμόν, ἀρνύμενος ἥν τε ψυχὴν καὶ νόστον ἑταίρων. ἀλλ' οὐδ' ὣς ἑτάρους ἐρρύσατο, ἱέμενός περ: αὐτω̂ν γὰρ σφετέρῃσιν ἀτασθαλίῃσιν ὄλοντο, νήπιοι, οἳ κατὰ βου̂ς ̔Υπερίονος ̓Ηελίοιο ἤσθιον: αὐτὰρ ὁ τοι̂σιν ἀφείλετο νόστιμον ἠ̂μαρ. τω̂ν ἁμόθεν γε, θεά, θύγατερ Διός, εἰπὲ καὶ ἡμι̂ν.''') def test_russian_characters_washing_1_0(self): """textutils - washing greek characters for XML 1.0.""" self.assertEqual(wash_for_xml(''' В тени дерев, над чистыми водами Дерновый холм вы видите ль, друзья? Чуть слышно там плескает в брег струя; Чуть ветерок там дышит меж листами; На ветвях лира и венец... Увы! друзья, сей холм - могила; Здесь прах певца земля сокрыла; Бедный певец!''', xml_version='1.1'), ''' В тени дерев, над чистыми водами Дерновый холм вы видите ль, друзья? Чуть слышно там плескает в брег струя; Чуть ветерок там дышит меж листами; На ветвях лира и венец... Увы! друзья, сей холм - могила; Здесь прах певца земля сокрыла; Бедный певец!''') def test_russian_characters_washing_1_1(self): """textutils - washing greek characters for XML 1.1.""" self.assertEqual(wash_for_xml(''' В тени дерев, над чистыми водами Дерновый холм вы видите ль, друзья? Чуть слышно там плескает в брег струя; Чуть ветерок там дышит меж листами; На ветвях лира и венец... Увы! друзья, сей холм - могила; Здесь прах певца земля сокрыла; Бедный певец!''', xml_version='1.1'), ''' В тени дерев, над чистыми водами Дерновый холм вы видите ль, друзья? Чуть слышно там плескает в брег струя; Чуть ветерок там дышит меж листами; На ветвях лира и венец... Увы! друзья, сей холм - могила; Здесь прах певца земля сокрыла; Бедный певец!''') def test_illegal_characters_washing_1_0(self): """textutils - washing illegal characters for XML 1.0.""" self.assertEqual(wash_for_xml(chr(8) + chr(9) + 'some chars'), '\tsome chars') self.assertEqual(wash_for_xml('$b\bar{b}$'), '$bar{b}$') def test_illegal_characters_washing_1_1(self): """textutils - washing illegal characters for XML 1.1.""" self.assertEqual(wash_for_xml(chr(8) + chr(9) + 'some chars', xml_version='1.1'), '\x08\tsome chars') self.assertEqual(wash_for_xml('$b\bar{b}$', xml_version='1.1'), '$b\x08ar{b}$') class WashForUTF8Test(InvenioTestCase): def test_normal_legal_string_washing(self): """textutils - testing UTF-8 washing on a perfectly normal string""" some_str = "This is an example string" self.assertEqual(some_str, wash_for_utf8(some_str)) def test_chinese_string_washing(self): """textutils - testing washing functions on chinese script""" some_str = """春眠暁を覚えず 処処に啼鳥と聞く 夜来風雨の声 花落つること 知んぬ多少ぞ""" self.assertEqual(some_str, wash_for_utf8(some_str)) def test_russian_characters_washing(self): """textutils - washing Russian characters for UTF-8""" self.assertEqual(wash_for_utf8(''' В тени дерев, над чистыми водами Дерновый холм вы видите ль, друзья? Чуть слышно там плескает в брег струя; Чуть ветерок там дышит меж листами; На ветвях лира и венец... Увы! друзья, сей холм - могила; Здесь прах певца земля сокрыла; Бедный певец!'''), ''' В тени дерев, над чистыми водами Дерновый холм вы видите ль, друзья? Чуть слышно там плескает в брег струя; Чуть ветерок там дышит меж листами; На ветвях лира и венец... Увы! друзья, сей холм - могила; Здесь прах певца земля сокрыла; Бедный певец!''') def test_remove_incorrect_unicode_characters(self): """textutils - washing out the incorrect characters""" self.assertEqual(wash_for_utf8("Ź\206dź\204bło żół\203wia \202"), "Źdźbło żółwia ") def test_empty_string_wash(self): """textutils - washing an empty string""" self.assertEqual(wash_for_utf8(""), "") def test_only_incorrect_unicode_wash(self): """textutils - washing an empty string""" self.assertEqual(wash_for_utf8("\202\203\204\205"), "") def test_raising_exception_on_incorrect(self): """textutils - assuring an exception on incorrect input""" self.assertRaises(UnicodeDecodeError, wash_for_utf8, "\202\203\204\205", correct=False) def test_already_utf8_input(self): """textutils - washing a Unicode string into UTF-8 binary string""" self.assertEqual('Göppert', wash_for_utf8(u'G\xf6ppert', True)) class WrapTextInABoxTest(InvenioTestCase): """Test functions related to wrap_text_in_a_box function.""" def test_plain_wrap_text_in_a_box(self): """textutils - wrap_text_in_a_box plain.""" result = """ ********************************************** ** foo bar ** ********************************************** """ self.assertEqual(wrap_text_in_a_box('foo bar'), result) def test_empty_wrap_text_in_a_box(self): """textutils - wrap_text_in_a_box empty.""" result = """ ********************************************** ********************************************** """ self.assertEqual(wrap_text_in_a_box(), result) def test_with_title_wrap_text_in_a_box(self): """textutils - wrap_text_in_a_box with title.""" result = """ ********************************************** ** a Title! ** ** **************************************** ** ** foo bar ** ********************************************** """ self.assertEqual(wrap_text_in_a_box('foo bar', title='a Title!'), result) def test_multiline_wrap_text_in_a_box(self): """textutils - wrap_text_in_a_box multiline.""" result = """ ********************************************** ** foo bar ** ********************************************** """ self.assertEqual(wrap_text_in_a_box('foo\n bar'), result) def test_real_multiline_wrap_text_in_a_box(self): """textutils - wrap_text_in_a_box real multiline.""" result = """ ********************************************** ** foo ** ** bar ** ********************************************** """ self.assertEqual(wrap_text_in_a_box('foo\n\nbar'), result) def test_real_no_width_wrap_text_in_a_box(self): """textutils - wrap_text_in_a_box no width.""" result = """ ************ ** foobar ** ************ """ self.assertEqual(wrap_text_in_a_box('foobar', min_col=0), result) def test_real_nothing_at_all_wrap_text_in_a_box(self): """textutils - wrap_text_in_a_box nothing at all.""" result = """ ****** ****** """ self.assertEqual(wrap_text_in_a_box(min_col=0), result) def test_real_squared_wrap_text_in_a_box(self): """textutils - wrap_text_in_a_box squared style.""" result = """ +--------+ | foobar | +--------+ """ self.assertEqual(wrap_text_in_a_box('foobar', style='squared', min_col=0), result) def test_indented_text_wrap_text_in_a_box(self): """textutils - wrap_text_in_a_box indented text.""" text = """ def test_real_squared_wrap_text_in_a_box(self):\n \"""wrap_text_in_a_box - squared style.\"""\n result = \"""\n +--------+\n | foobar |\n +--------+ \""" """ result = """ ****************************** ** def test_real_square ** ** d_wrap_text_in_a_box ** ** (self): ** ** \"""wrap_text_in_ ** ** a_box - squared ** ** style.\""" ** ** result = \""" ** ** +--------+ ** ** | foobar | ** ** +--------+\""" ** ****************************** """ self.assertEqual(wrap_text_in_a_box(text, min_col=0, max_col=30, break_long=True), result) def test_single_new_line_wrap_text_in_a_box(self): """textutils - wrap_text_in_a_box single new line.""" result = """ ********************************************** ** ciao come và? ** ********************************************** """ self.assertEqual(wrap_text_in_a_box("ciao\ncome và?"), result) def test_indented_box_wrap_text_in_a_box(self): """textutils - wrap_text_in_a_box indented box.""" result = """ ********************************************** ** foobar ** ********************************************** """ self.assertEqual(wrap_text_in_a_box('foobar', tab_num=1), result) def test_real_conclusion_wrap_text_in_a_box(self): """textutils - wrap_text_in_a_box conclusion.""" result = """---------------------------------------- foobar \n""" self.assertEqual(wrap_text_in_a_box('foobar', style='conclusion'), result) def test_real_longtext_wrap_text_in_a_box(self): """textutils - wrap_text_in_a_box long text.""" text = """Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. At vero eos et accusamus et iusto odio dignissimos ducimus qui blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias excepturi sint occaecati cupiditate non provident, similique sunt in culpa qui officia deserunt mollitia animi, id est laborum et dolorum fuga. Et harum quidem rerum facilis est et expedita distinctio. Nam libero tempore, cum soluta nobis est eligendi optio cumque nihil impedit quo minus id quod maxime placeat facere possimus, omnis voluptas assumenda est, omnis dolor repellendus. Temporibus autem quibusdam et aut officiis debitis aut rerum necessitatibus saepe eveniet ut et voluptates repudiandae sint et molestiae non recusandae. Itaque earum rerum hic tenetur a sapiente delectus, ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat.""" result = """ ************************************************************************ ** Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do ** ** eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut ** ** enim ad minim veniam, quis nostrud exercitation ullamco laboris ** ** nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in ** ** reprehenderit in voluptate velit esse cillum dolore eu fugiat ** ** nulla pariatur. Excepteur sint occaecat cupidatat non proident, ** ** sunt in culpa qui officia deserunt mollit anim id est laborum. ** ** At vero eos et accusamus et iusto odio dignissimos ducimus qui ** ** blanditiis praesentium voluptatum deleniti atque corrupti quos ** ** dolores et quas molestias excepturi sint occaecati cupiditate non ** ** provident, similique sunt in culpa qui officia deserunt mollitia ** ** animi, id est laborum et dolorum fuga. Et harum quidem rerum ** ** facilis est et expedita distinctio. Nam libero tempore, cum soluta ** ** nobis est eligendi optio cumque nihil impedit quo minus id quod ** ** maxime placeat facere possimus, omnis voluptas assumenda est, ** ** omnis dolor repellendus. Temporibus autem quibusdam et aut ** ** officiis debitis aut rerum necessitatibus saepe eveniet ut et ** ** voluptates repudiandae sint et molestiae non recusandae. Itaque ** ** earum rerum hic tenetur a sapiente delectus, ut aut reiciendis ** ** voluptatibus maiores alias consequatur aut perferendis doloribus ** ** asperiores repellat. ** ************************************************************************ """ self.assertEqual(wrap_text_in_a_box(text), result) class DecodeToUnicodeTest(InvenioTestCase): """Test functions related to decode_to_unicode function.""" if CHARDET_AVAILABLE: def test_decode_to_unicode(self): """textutils - decode_to_unicode.""" self.assertEqual(decode_to_unicode('\202\203\204\205', default_encoding='latin1'), u'\x82\x83\x84\x85') self.assertEqual(decode_to_unicode('àèéìòù'), u'\xe0\xe8\xe9\xec\xf2\xf9') self.assertEqual(decode_to_unicode('Ιθάκη'), u'\u0399\u03b8\u03ac\u03ba\u03b7') else: pass class Latex2UnicodeTest(InvenioTestCase): """Test functions related to translating LaTeX symbols to Unicode.""" def test_latex_to_unicode(self): """textutils - latex_to_unicode""" self.assertEqual(translate_latex2unicode("\\'a \\'i \\'U").encode('utf-8'), "á í Ú") self.assertEqual(translate_latex2unicode("\\'N \\k{i}"), u'\u0143 \u012f') self.assertEqual(translate_latex2unicode("\\AAkeson"), u'\u212bkeson') self.assertEqual(translate_latex2unicode("$\\mathsl{\\Zeta}$"), u'\U0001d6e7') class TestStripping(InvenioTestCase): """Test for stripping functions like accents and control characters.""" - if UNIDECODE_AVAILABLE: - def test_text_to_ascii(self): - """textutils - transliterate to ascii using unidecode""" - self.assert_(translate_to_ascii( - ["á í Ú", "H\xc3\xb6hne", "Åge Øst Vær", "normal"]) in - (["a i U", "Hohne", "Age Ost Vaer", "normal"], ## unidecode < 0.04.13 - ['a i U', 'Hoehne', 'Age Ost Vaer', 'normal']) ## unidecode >= 0.04.13 - ) - self.assertEqual(translate_to_ascii("àèéìòù"), ["aeeiou"]) - self.assertEqual(translate_to_ascii("ß"), ["ss"]) - self.assertEqual(translate_to_ascii(None), None) - self.assertEqual(translate_to_ascii([]), []) - self.assertEqual(translate_to_ascii([None]), [None]) - self.assertEqual(translate_to_ascii("√"), [""]) - else: - pass + def test_text_to_ascii(self): + """textutils - transliterate to ascii using unidecode""" + self.assert_(translate_to_ascii( + ["á í Ú", "H\xc3\xb6hne", "Åge Øst Vær", "normal"]) in + (["a i U", "Hohne", "Age Ost Vaer", "normal"], ## unidecode < 0.04.13 + ['a i U', 'Hoehne', 'Age Ost Vaer', 'normal']) ## unidecode >= 0.04.13 + ) + self.assertEqual(translate_to_ascii("àèéìòù"), ["aeeiou"]) + self.assertEqual(translate_to_ascii("ß"), ["ss"]) + self.assertEqual(translate_to_ascii(None), None) + self.assertEqual(translate_to_ascii([]), []) + self.assertEqual(translate_to_ascii([None]), [None]) + self.assertEqual(translate_to_ascii("√"), [""]) def test_strip_accents(self): """textutils - transliterate to ascii (basic)""" self.assertEqual("memememe", strip_accents('mémêmëmè')) self.assertEqual("MEMEMEME", strip_accents('MÉMÊMËMÈ')) self.assertEqual("oe", strip_accents('œ')) self.assertEqual("OE", strip_accents('Œ')) class TestDiffering(InvenioTestCase): """Test for differing two strings.""" string1 = """Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec fringilla tellus eget fringilla sagittis. Pellentesque posuere lacus id erat tristique pulvinar. Morbi volutpat, diam eget interdum lobortis, lacus mi cursus leo, sit amet porttitor neque est vitae lectus. Donec tempor metus vel tincidunt fringilla. Nam iaculis lacinia nisl, enim sollicitudin convallis. Morbi ut mauris velit. Proin suscipit dolor id risus placerat sodales nec id elit. Morbi vel lacinia lectus, eget laoreet dui. Nunc commodo neque porttitor eros placerat, sed ultricies purus accumsan. In velit nisi, accumsan molestie gravida a, rutrum in augue. Nulla pharetra purus nec dolor ornare, ut aliquam odio placerat. Aenean ultrices condimentum quam vitae pharetra.""" string2 = """Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec fringilla tellus eget fringilla sagittis. Pellentesque posuere lacus id erat. eget interdum lobortis, lacus mi cursus leo, sit amet porttitor neque est vitae lectus. Donec tempor metus vel tincidunt fringilla. Nam iaculis lacinia nisl, consectetur viverra enim sollicitudin convallis. Morbi ut mauris velit. Proin suscipit dolor id risus placerat sodales nec id elit. Morbi vel lacinia lectus, eget laoreet placerat sodales nec id elit. Morbi vel lacinia lectus, eget laoreet dui. Nunc commodo neque porttitor eros placerat, sed ultricies purus accumsan. In velit nisi, lorem ipsum lorem gravida a, rutrum in augue. Nulla pharetra purus nec dolor ornare, ut aliquam odio placerat. Aenean ultrices condimentum quam vitae pharetra.""" def test_show_diff_plain_text(self): """textutils - show_diff() with plain text""" expected_result = """ Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec fringilla tellus eget fringilla sagittis. Pellentesque -posuere lacus id erat. +posuere lacus id erat tristique pulvinar. Morbi volutpat, diam eget interdum lobortis, lacus mi cursus leo, sit amet porttitor neque est vitae lectus. Donec tempor metus vel tincidunt fringilla. -Nam iaculis lacinia nisl, consectetur viverra enim sollicitudin +Nam iaculis lacinia nisl, enim sollicitudin convallis. Morbi ut mauris velit. Proin suscipit dolor id risus placerat sodales nec id elit. Morbi vel lacinia lectus, eget laoreet -placerat sodales nec id elit. Morbi vel lacinia lectus, eget laoreet dui. Nunc commodo neque porttitor eros placerat, sed ultricies purus -accumsan. In velit nisi, lorem ipsum lorem gravida a, rutrum in augue. +accumsan. In velit nisi, accumsan molestie gravida a, rutrum in augue. Nulla pharetra purus nec dolor ornare, ut aliquam odio placerat. Aenean ultrices condimentum quam vitae pharetra. """ self.assertEqual(show_diff(self.string1, self.string2), expected_result) def test_show_diff_html(self): """textutils - show_diff() with plain text""" expected_result = """<pre> Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec fringilla tellus eget fringilla sagittis. Pellentesque <strong class="diff_field_deleted">posuere lacus id erat.</strong> <strong class="diff_field_added">posuere lacus id erat tristique pulvinar. Morbi volutpat, diam</strong> eget interdum lobortis, lacus mi cursus leo, sit amet porttitor neque est vitae lectus. Donec tempor metus vel tincidunt fringilla. <strong class="diff_field_deleted">Nam iaculis lacinia nisl, consectetur viverra enim sollicitudin</strong> <strong class="diff_field_added">Nam iaculis lacinia nisl, enim sollicitudin</strong> convallis. Morbi ut mauris velit. Proin suscipit dolor id risus placerat sodales nec id elit. Morbi vel lacinia lectus, eget laoreet <strong class="diff_field_deleted">placerat sodales nec id elit. Morbi vel lacinia lectus, eget laoreet</strong> dui. Nunc commodo neque porttitor eros placerat, sed ultricies purus <strong class="diff_field_deleted">accumsan. In velit nisi, lorem ipsum lorem gravida a, rutrum in augue.</strong> <strong class="diff_field_added">accumsan. In velit nisi, accumsan molestie gravida a, rutrum in augue.</strong> Nulla pharetra purus nec dolor ornare, ut aliquam odio placerat. Aenean ultrices condimentum quam vitae pharetra. </pre>""" self.assertEqual(show_diff(self.string1, self.string2, prefix="<pre>", suffix="</pre>", prefix_unchanged='', suffix_unchanged='', prefix_removed='<strong class="diff_field_deleted">', suffix_removed='</strong>', prefix_added='<strong class="diff_field_added">', suffix_added='</strong>'), expected_result) class TestALALC(InvenioTestCase): """Test for handling ALA-LC transliteration.""" - if UNIDECODE_AVAILABLE: - def test_alalc(self): - msg = "眾鳥高飛盡" - encoded_text, encoding = guess_minimum_encoding(msg) - unicode_text = unicode(encoded_text.decode(encoding)) - self.assertEqual("Zhong Niao Gao Fei Jin ", - transliterate_ala_lc(unicode_text)) + def test_alalc(self): + msg = "眾鳥高飛盡" + encoded_text, encoding = guess_minimum_encoding(msg) + unicode_text = unicode(encoded_text.decode(encoding)) + self.assertEqual("Zhong Niao Gao Fei Jin ", + transliterate_ala_lc(unicode_text)) class LatexEscape(InvenioTestCase): """Test for escape latex function""" def test_escape_latex(self): unescaped = "this is unescaped latex & % $ # _ { } ~ \ ^ and some multi-byte chars: żółw mémêmëmè" escaped = escape_latex(unescaped) self.assertEqual(escaped, "this is unescaped latex \\& \\% \\$ \\# \\_ \\{ \\} \\~{} \\textbackslash{} \\^{} and some multi-byte chars: \xc5\xbc\xc3\xb3\xc5\x82w m\xc3\xa9m\xc3\xaam\xc3\xabm\xc3\xa8") TEST_SUITE = make_test_suite(WrapTextInABoxTest, GuessMinimumEncodingTest, WashForXMLTest, WashForUTF8Test, DecodeToUnicodeTest, Latex2UnicodeTest, TestStripping, TestALALC, TestDiffering) if __name__ == "__main__": run_test_suite(TEST_SUITE)