No OneTemporary
Actions

Subscribers

None

File Metadata

Created: Thu, May 22, 11:12

View Options

This file is larger than 256 KB, so syntax highlighting was skipped.

	diff --git a/INSTALL b/INSTALL
	index 1b6700bd1..b4a3c5eea 100644
	--- a/INSTALL
	+++ b/INSTALL
	@@ -1,842 +1,845 @@
	Invenio INSTALLATION
	====================

	About
	=====

	This document specifies how to build, customize, and install Invenio
	v1.1.2 for the first time. See RELEASE-NOTES if you are upgrading
	from a previous Invenio release.

	Contents
	========

	0. Prerequisites
	1. Quick instructions for the impatient Invenio admin
	2. Detailed instructions for the patient Invenio admin

	0. Prerequisites
	================

	Here is the software you need to have around before you
	start installing Invenio:

	a) Unix-like operating system. The main development and
	production platforms for Invenio at CERN are GNU/Linux
	distributions Debian, Gentoo, Scientific Linux (aka RHEL),
	Ubuntu, but we also develop on Mac OS X. Basically any Unix
	system supporting the software listed below should do.

	If you are using Debian GNU/Linux ``Lenny'' or later, then you
	can install most of the below-mentioned prerequisites and
	recommendations by running:

	$ sudo aptitude install python-dev apache2-mpm-prefork \
	mysql-server mysql-client python-mysqldb \
	python-4suite-xml python-simplejson python-xml \
	python-libxml2 python-libxslt1 gnuplot poppler-utils \
	gs-common clisp gettext libapache2-mod-wsgi unzip \
	python-dateutil python-rdflib python-pyparsing \
	python-gnuplot python-magic pdftk html2text giflib-tools \
	pstotext netpbm python-pypdf python-chardet python-lxml \
	python-unidecode

	You may also want to install some of the following packages,
	if you have them available on your concrete architecture:

	$ sudo aptitude install sbcl cmucl pylint pychecker pyflakes \
	python-profiler python-epydoc libapache2-mod-xsendfile \
	openoffice.org python-utidylib python-beautifulsoup

	(Note that if you use pip to manage your Python dependencies
	instead of operating system packages, please see the section
	(d) below on how to use pip instead of aptitude.)

	Moreover, you should install some Message Transfer Agent (MTA)
	such as Postfix so that Invenio can email notification
	alerts or registration information to the end users, contact
	moderators and reviewers of submitted documents, inform
	administrators about various runtime system information, etc:

	$ sudo aptitude install postfix

	After running the above-quoted aptitude command(s), you can
	proceed to configuring your MySQL server instance
	(max_allowed_packet in my.cnf, see item 0b below) and then to
	installing the Invenio software package in the section 1
	below.

	If you are using another operating system, then please
	continue reading the rest of this prerequisites section, and
	please consult our wiki pages for any concrete hints for your
	specific operating system.
	<https://twiki.cern.ch/twiki/bin/view/CDS/Invenio>

	b) MySQL server (may be on a remote machine), and MySQL client
	(must be available locally too). MySQL versions 4.1 or 5.0
	are supported. Please set the variable "max_allowed_packet"
	in your "my.cnf" init file to at least 4M. (For sites such as
	INSPIRE, having 1M records with 10M citer-citee pairs in its
	citation map, you may need to increase max_allowed_packet to
	1G.) You may perhaps also want to run your MySQL server
	natively in UTF-8 mode by setting "default-character-set=utf8"
	in various parts of your "my.cnf" file, such as in the
	"[mysql]" part and elsewhere; but this is not really required.
	<http://mysql.com/>

	c) Apache 2 server, with support for loading DSO modules, and
	optionally with SSL support for HTTPS-secure user
	authentication, and mod_xsendfile for off-loading file
	downloads away from Invenio processes to Apache.
	<http://httpd.apache.org/>
	<http://tn123.ath.cx/mod_xsendfile/>

	d) Python v2.6 or above:
	<http://python.org/>
	as well as the following Python modules:
	- (mandatory) MySQLdb (version >= 1.2.1_p2; see below)
	<http://sourceforge.net/projects/mysql-python>
	- (mandatory) Pyparsing, for document parsing
	<http://pyparsing.wikispaces.com/>
	+ - (mandatory) Required for Python-2.4 only.
	+ It is already built-in in Python-2.6
	+ <http://effbot.org/zone/element-index.htm>
	- (recommended) python-dateutil, for complex date processing:
	<http://labix.org/python-dateutil>
	- (recommended) PyXML, for XML processing:
	<http://pyxml.sourceforge.net/topics/download.html>
	- (recommended) PyRXP, for very fast XML MARC processing:
	<http://www.reportlab.org/pyrxp.html>
	- (recommended) lxml, for XML/XLST processing:
	<http://lxml.de/>
	- (recommended) libxml2-python, for XML/XLST processing:
	<ftp://xmlsoft.org/libxml2/python/>
	- (recommended) Gnuplot.Py, for producing graphs:
	<http://gnuplot-py.sourceforge.net/>
	- (recommended) Snowball Stemmer, for stemming:
	<http://snowball.tartarus.org/wrappers/PyStemmer-1.0.1.tar.gz>
	- (recommended) py-editdist, for record merging:
	<http://www.mindrot.org/projects/py-editdist/>
	- (recommended) numpy, for citerank methods:
	<http://numpy.scipy.org/>
	- (recommended) magic, for full-text file handling:
	<http://www.darwinsys.com/file/>
	- (optional) chardet, for character encoding detection:
	<http://chardet.feedparser.org/>
	- (optional) 4suite, slower alternative to PyRXP and
	libxml2-python:
	<http://4suite.org/>
	- (optional) feedparser, for web journal creation:
	<http://feedparser.org/>
	- (optional) RDFLib, to use RDF ontologies and thesauri:
	<http://rdflib.net/>
	- (optional) mechanize, to run regression web test suite:
	<http://wwwsearch.sourceforge.net/mechanize/>
	- (optional) python-mock, mocking library for the test suite:
	<http://www.voidspace.org.uk/python/mock/>
	- (optional) utidylib, for HTML washing:
	<http://utidylib.berlios.de/>
	- (optional) Beautiful Soup, for HTML washing:
	<http://www.crummy.com/software/BeautifulSoup/>
	- (optional) Python Twitter (and its dependencies) if you want
	to use the Twitter Fetcher bibtasklet:
	<http://code.google.com/p/python-twitter/>
	- (optional) Python OpenID if you want to enable OpenID support
	for authentication:
	<http://pypi.python.org/pypi/python-openid/>
	- (optional) Python Rauth if you want to enable OAuth 1.0/2.0
	support for authentication (depends on Python-2.6 or later):
	<http://packages.python.org/rauth/>
	- (optional) unidecode, for ASCII representation of Unicode
	text:
	<https://pypi.python.org/pypi/Unidecode>

	Note that if you are using pip to install and manage your
	Python dependencies, then you can run:

	$ sudo pip install -r requirements.txt
	$ sudo pip install -r requirements-extras.txt

	to install all manadatory, recommended, and optional packages
	mentioned above.

	e) mod_wsgi Apache module. Versions 3.x and above are
	recommended.
	<http://code.google.com/p/modwsgi/>

	f) If you want to be able to extract references from PDF fulltext
	files, then you need to install pdftotext version 3 at least.
	<http://poppler.freedesktop.org/>
	<http://www.foolabs.com/xpdf/home.html>

	g) If you want to be able to search for words in the fulltext
	files (i.e. to have fulltext indexing) or to stamp submitted
	files, then you need as well to install some of the following
	tools:
	- for Microsoft Office/OpenOffice.org document conversion:
	OpenOffice.org
	<http://www.openoffice.org/>
	- for PDF file stamping: pdftk, pdf2ps
	<http://www.accesspdf.com/pdftk/>
	<http://www.cs.wisc.edu/~ghost/doc/AFPL/>
	- for PDF files: pdftotext or pstotext
	<http://poppler.freedesktop.org/>
	<http://www.foolabs.com/xpdf/home.html>
	<http://www.cs.wisc.edu/~ghost/doc/AFPL/>
	- for PostScript files: pstotext or ps2ascii
	<http://www.cs.wisc.edu/~ghost/doc/AFPL/>
	- for DjVu creation, elaboration: DjVuLibre
	<http://djvu.sourceforge.net>
	- to perform OCR: OCRopus (tested only with release 0.3.1)
	<http://code.google.com/p/ocropus/>
	- to perform different image elaborations: ImageMagick
	<http://www.imagemagick.org/>
	- to generate PDF after OCR: netpbm, ReportLab and pyPdf or pyPdf2
	<http://netpbm.sourceforge.net/>
	<http://www.reportlab.org/rl_toolkit.html>
	<http://pybrary.net/pyPdf/>
	<http://knowah.github.io/PyPDF2/>

	h) If you have chosen to install fast XML MARC Python processors
	in the step d) above, then you have to install the parsers
	themselves:
	- (optional) 4suite:
	<http://4suite.org/>

	i) (recommended) Gnuplot, the command-line driven interactive
	plotting program. It is used to display download and citation
	history graphs on the Detailed record pages on the web
	interface. Note that Gnuplot must be compiled with PNG output
	support, that is, with the GD library. Note also that Gnuplot
	is not required, only recommended.
	<http://www.gnuplot.info/>

	j) (recommended) A Common Lisp implementation, such as CLISP,
	SBCL or CMUCL. It is used for the web server log analysing
	tool and the metadata checking program. Note that any of the
	three implementations CLISP, SBCL, or CMUCL will do. CMUCL
	produces fastest machine code, but it does not support UTF-8
	yet. Pick up CLISP if you don't know what to do. Note that a
	Common Lisp implementation is not required, only recommended.
	<http://clisp.cons.org/>
	<http://www.cons.org/cmucl/>
	<http://sbcl.sourceforge.net/>

	k) GNU gettext, a set of tools that makes it possible to
	translate the application in multiple languages.
	<http://www.gnu.org/software/gettext/>
	This is available by default on many systems.

	l) (recommended) xlwt 0.7.2, Library to create spreadsheet files
	compatible with MS Excel 97/2000/XP/2003 XLS files, on any
	platform, with Python 2.3 to 2.6
	<http://pypi.python.org/pypi/xlwt>

	m) (recommended) matplotlib 1.0.0 is a python 2D plotting library
	which produces publication quality figures in a variety of
	hardcopy formats and interactive environments across
	platforms. matplotlib can be used in python scripts, the
	python and ipython shell (ala MATLAB® or Mathematica®),
	web application servers, and six graphical user interface
	toolkits. It is used to generate pie graphs in the custom
	summary query (WebStat)
	<http://matplotlib.sourceforge.net>

	n) (optional) FFmpeg, an open-source tools an libraries collection
	to convert video and audio files. It makes use of both internal
	as well as external libraries to generate videos for the web, such
	as Theora, WebM and H.264 out of almost any thinkable video input.
	FFmpeg is needed to run video related modules and submission workflows
	in Invenio. The minimal configuration of ffmpeg for the Invenio demo site
	requires a number of external libraries. It is highly recommended
	to remove all installed versions and packages that are comming with
	various Linux distributions and install the latest versions from
	sources. Additionally, you will need the Mediainfo Library for multimedia
	metadata handling.
	Minimum libraries for the demo site:
	- the ffmpeg multimedia encoder tools
	<http://ffmpeg.org/>
	- a library for jpeg images needed for thumbnail extraction
	<http://www.openjpeg.org/>
	- a library for the ogg container format, needed for Vorbis and Theora
	<http://www.xiph.org/ogg/>
	- the OGG Vorbis audi codec library
	<http://www.vorbis.com/>
	- the OGG Theora video codec library
	<http://www.theora.org/>
	- the WebM video codec library
	<http://www.webmproject.org/>
	- the mediainfo library for multimedia metadata
	<http://mediainfo.sourceforge.net/>
	Recommended for H.264 video (!be aware of licensing issues!):
	- a library for H.264 video encoding
	<http://www.videolan.org/developers/x264.html>
	- a library for Advanced Audi Coding
	<http://www.audiocoding.com/faac.html>
	- a library for MP3 encoding
	<http://lame.sourceforge.net/>

	Note that the configure script checks whether you have all the
	prerequisite software installed and that it won't let you continue
	unless everything is in order. It also warns you if it cannot find
	some optional but recommended software.


	1. Quick instructions for the impatient Invenio admin
	=========================================================

	1a. Installation
	----------------

	$ cd $HOME/src/
	$ wget http://invenio-software.org/download/invenio-1.1.2.tar.gz
	$ wget http://invenio-software.org/download/invenio-1.1.2.tar.gz.md5
	$ wget http://invenio-software.org/download/invenio-1.1.2.tar.gz.sig
	$ md5sum -c invenio-1.1.2.tar.gz.md5
	$ gpg --verify invenio-1.1.2.tar.gz.sig invenio-1.1.2.tar.gz
	$ tar xvfz invenio-1.1.2.tar.gz
	$ cd invenio-1.1.2
	$ ./configure
	$ make
	$ make install
	$ make install-mathjax-plugin ## optional
	$ make install-jquery-plugins ## optional
	$ make install-ckeditor-plugin ## optional
	$ make install-pdfa-helper-files ## optional
	$ make install-mediaelement ## optional
	$ make install-solrutils ## optional
	$ make install-js-test-driver ## optional

	1b. Configuration
	-----------------

	$ sudo chown -R www-data.www-data /opt/invenio
	$ sudo -u www-data emacs /opt/invenio/etc/invenio-local.conf
	$ sudo -u www-data /opt/invenio/bin/inveniocfg --update-all
	$ sudo -u www-data /opt/invenio/bin/inveniocfg --create-tables
	$ sudo -u www-data /opt/invenio/bin/inveniocfg --load-bibfield-conf
	$ sudo -u www-data /opt/invenio/bin/inveniocfg --load-webstat-conf
	$ sudo -u www-data /opt/invenio/bin/inveniocfg --create-apache-conf
	$ sudo /etc/init.d/apache2 restart
	$ sudo -u www-data /opt/invenio/bin/inveniocfg --check-openoffice
	$ sudo -u www-data /opt/invenio/bin/inveniocfg --create-demo-site
	$ sudo -u www-data /opt/invenio/bin/inveniocfg --load-demo-records
	$ sudo -u www-data /opt/invenio/bin/inveniocfg --run-unit-tests
	$ sudo -u www-data /opt/invenio/bin/inveniocfg --run-regression-tests
	$ sudo -u www-data /opt/invenio/bin/inveniocfg --run-web-tests
	$ sudo -u www-data /opt/invenio/bin/inveniocfg --remove-demo-records
	$ sudo -u www-data /opt/invenio/bin/inveniocfg --drop-demo-site
	$ firefox http://your.site.com/help/admin/howto-run

	2. Detailed instructions for the patient Invenio admin
	==========================================================

	2a. Installation
	----------------

	The Invenio uses standard GNU autoconf method to build and
	install its files. This means that you proceed as follows:

	$ cd $HOME/src/

	Change to a directory where we will build the Invenio
	sources. (The built files will be installed into different
	"target" directories later.)

	$ wget http://invenio-software.org/download/invenio-1.1.2.tar.gz
	$ wget http://invenio-software.org/download/invenio-1.1.2.tar.gz.md5
	$ wget http://invenio-software.org/download/invenio-1.1.2.tar.gz.sig

	Fetch Invenio source tarball from the distribution server,
	together with MD5 checksum and GnuPG cryptographic signature
	files useful for verifying the integrity of the tarball.

	$ md5sum -c invenio-1.1.2.tar.gz.md5

	Verify MD5 checksum.

	$ gpg --verify invenio-1.1.2.tar.gz.sig invenio-1.1.2.tar.gz

	Verify GnuPG cryptographic signature. Note that you may
	first have to import my public key into your keyring, if you
	haven't done that already:
	$ gpg --keyserver wwwkeys.eu.pgp.net --recv-keys 0xBA5A2B67
	The output of the gpg --verify command should then read:
	Good signature from "Tibor Simko <tibor@simko.info>"
	You can safely ignore any trusted signature certification
	warning that may follow after the signature has been
	successfully verified.

	$ tar xvfz invenio-1.1.2.tar.gz

	Untar the distribution tarball.

	$ cd invenio-1.1.2

	Go to the source directory.

	$ ./configure

	Configure Invenio software for building on this specific
	platform. You can use the following optional parameters:

	--prefix=/opt/invenio

	Optionally, specify the Invenio general
	installation directory (default is /opt/invenio).
	It will contain command-line binaries and program
	libraries containing the core Invenio
	functionality, but also store web pages, runtime log
	and cache information, document data files, etc.
	Several subdirs like `bin', `etc', `lib', or `var'
	will be created inside the prefix directory to this
	effect. Note that the prefix directory should be
	chosen outside of the Apache htdocs tree, since only
	one its subdirectory (prefix/var/www) is to be
	accessible directly via the Web (see below).

	Note that Invenio won't install to any other
	directory but to the prefix mentioned in this
	configuration line.

	--with-python=/opt/python/bin/python2.7

	Optionally, specify a path to some specific Python
	binary. This is useful if you have more than one
	Python installation on your system. If you don't set
	this option, then the first Python that will be found
	in your PATH will be chosen for running Invenio.

	--with-mysql=/opt/mysql/bin/mysql

	Optionally, specify a path to some specific MySQL
	client binary. This is useful if you have more than
	one MySQL installation on your system. If you don't
	set this option, then the first MySQL client
	executable that will be found in your PATH will be
	chosen for running Invenio.

	--with-clisp=/opt/clisp/bin/clisp

	Optionally, specify a path to CLISP executable. This
	is useful if you have more than one CLISP
	installation on your system. If you don't set this
	option, then the first executable that will be found
	in your PATH will be chosen for running Invenio.

	--with-cmucl=/opt/cmucl/bin/lisp

	Optionally, specify a path to CMUCL executable. This
	is useful if you have more than one CMUCL
	installation on your system. If you don't set this
	option, then the first executable that will be found
	in your PATH will be chosen for running Invenio.

	--with-sbcl=/opt/sbcl/bin/sbcl

	Optionally, specify a path to SBCL executable. This
	is useful if you have more than one SBCL
	installation on your system. If you don't set this
	option, then the first executable that will be found
	in your PATH will be chosen for running Invenio.

	--with-openoffice-python

	Optionally, specify the path to the Python interpreter
	embedded with OpenOffice.org. This is normally not
	contained in the normal path. If you don't specify this
	it won't be possible to use OpenOffice.org to convert from and
	to Microsoft Office and OpenOffice.org documents.

	This configuration step is mandatory. Usually, you do this
	step only once.

	(Note that if you are building Invenio not from a
	released tarball, but from the Git sources, then you have to
	generate the configure file via autotools:

	$ sudo aptitude install automake1.9 autoconf
	$ aclocal-1.9
	$ automake-1.9 -a
	$ autoconf

	after which you proceed with the usual configure command.)

	$ make

	Launch the Invenio build. Since many messages are printed
	during the build process, you may want to run it in a
	fast-scrolling terminal such as rxvt or in a detached screen
	session.

	During this step all the pages and scripts will be
	pre-created and customized based on the config you have
	edited in the previous step.

	Note that on systems such as FreeBSD or Mac OS X you have to
	use GNU make ("gmake") instead of "make".

	$ make install

	Install the web pages, scripts, utilities and everything
	needed for Invenio runtime into respective installation
	directories, as specified earlier by the configure command.

	Note that if you are installing Invenio for the first
	time, you will be asked to create symbolic link(s) from
	Python's site-packages system-wide directory(ies) to the
	installation location. This is in order to instruct Python
	where to find Invenio's Python files. You will be
	hinted as to the exact command to use based on the
	parameters you have used in the configure command.

	$ make install-mathjax-plugin ## optional

	This will automatically download and install in the proper
	place MathJax, a JavaScript library to render LaTeX formulas
	in the client browser.

	Note that in order to enable the rendering you will have to
	set the variable CFG_WEBSEARCH_USE_MATHJAX_FOR_FORMATS in
	invenio-local.conf to a suitable list of output format
	codes. For example:
	CFG_WEBSEARCH_USE_MATHJAX_FOR_FORMATS = hd,hb

	$ make install-jquery-plugins ## optional

	This will automatically download and install in the proper
	place jQuery and related plugins. They are used for AJAX
	applications such as the record editor.

	Note that `unzip' is needed when installing jquery plugins.

	$ make install-ckeditor-plugin ## optional

	This will automatically download and install in the proper
	place CKeditor, a WYSIWYG Javascript-based editor (e.g. for
	the WebComment module).

	Note that in order to enable the editor you have to set the
	CFG_WEBCOMMENT_USE_RICH_EDITOR to True.

	$ make install-pdfa-helper-files ## optional

	This will automatically download and install in the proper
	place the helper files needed to create PDF/A files out of
	existing PDF files.

	$ make install-mediaelement ## optional

	This will automatically download and install the MediaElementJS
	HTML5 video player that is needed for videos on the DEMO site.

	$ make install-solrutils ## optional

	This will automatically download and install a Solr instance
	which can be used for full-text searching. See CFG_SOLR_URL
	variable in the invenio.conf. Note that the admin later has
	to take care of running init.d scripts which would start the
	Solr instance automatically.

	$ make install-js-test-driver ## optional

	This will automatically download and install JsTestDriver
	which is needed to run JS unit tests. Recommended for developers.

	2b. Configuration
	-----------------

	Once the basic software installation is done, we proceed to
	configuring your Invenio system.

	$ sudo chown -R www-data.www-data /opt/invenio

	For the sake of simplicity, let us assume that your Invenio
	installation will run under the `www-data' user process
	identity. The above command changes ownership of installed
	files to www-data, so that we shall run everything under
	this user identity from now on.

	For production purposes, you would typically enable Apache
	server to read all files from the installation place but to
	write only to the `var' subdirectory of your installation
	place. You could achieve this by configuring Unix directory
	group permissions, for example.

	$ sudo -u www-data emacs /opt/invenio/etc/invenio-local.conf

	Customize your Invenio installation. Please read the
	'invenio.conf' file located in the same directory that
	contains the vanilla default configuration parameters of
	your Invenio installation. If you want to customize some of
	these parameters, you should create a file named
	'invenio-local.conf' in the same directory where
	'invenio.conf' lives and you should write there only the
	customizations that you want to be different from the
	vanilla defaults.

	Here is a realistic, minimalist, yet production-ready
	example of what you would typically put there:

	$ cat /opt/invenio/etc/invenio-local.conf
	[Invenio]
	CFG_SITE_NAME = John Doe's Document Server
	CFG_SITE_NAME_INTL_fr = Serveur des Documents de John Doe
	CFG_SITE_URL = http://your.site.com
	CFG_SITE_SECURE_URL = https://your.site.com
	CFG_SITE_ADMIN_EMAIL = john.doe@your.site.com
	CFG_SITE_SUPPORT_EMAIL = john.doe@your.site.com
	CFG_WEBALERT_ALERT_ENGINE_EMAIL = john.doe@your.site.com
	CFG_WEBCOMMENT_ALERT_ENGINE_EMAIL = john.doe@your.site.com
	CFG_WEBCOMMENT_DEFAULT_MODERATOR = john.doe@your.site.com
	CFG_BIBAUTHORID_AUTHOR_TICKET_ADMIN_EMAIL = john.doe@your.site.com
	CFG_BIBCATALOG_SYSTEM_EMAIL_ADDRESS = john.doe@your.site.com
	CFG_DATABASE_HOST = localhost
	CFG_DATABASE_NAME = invenio
	CFG_DATABASE_USER = invenio
	CFG_DATABASE_PASS = my123p$ss
	CFG_BIBDOCFILE_ENABLE_BIBDOCFSINFO_CACHE = 1

	You should override at least the parameters mentioned above
	in order to define some very essential runtime parameters
	such as the name of your document server (CFG_SITE_NAME and
	CFG_SITE_NAME_INTL_*), the visible URL of your document
	server (CFG_SITE_URL and CFG_SITE_SECURE_URL), the email
	address of the local Invenio administrator, comment
	moderator, and alert engine (CFG_SITE_SUPPORT_EMAIL,
	CFG_SITE_ADMIN_EMAIL, etc), and last but not least your
	database credentials (CFG_DATABASE_*).

	If this is a first installation of Invenio it is recommended
	you set the CFG_BIBDOCFILE_ENABLE_BIBDOCFSINFO_CACHE
	variable to 1. If this is instead an upgrade from an existing
	installation don't add it until you have run:
	$ bibdocfile --fix-bibdocfsinfo-cache .

	The Invenio system will then read both the default
	invenio.conf file and your customized invenio-local.conf
	file and it will override any default options with the ones
	you have specifield in your local file. This cascading of
	configuration parameters will ease your future upgrades.

	If you want to have multiple Invenio instances for distributed
	video encoding, you need to share the same configuration amongs
	them and make some of the folders of the Invenio installation
	available for all nodes.

	Configure the allowed tasks for every node:

	CFG_BIBSCHED_NODE_TASKS = {
	"hostname_machine1" : ["bibindex", "bibupload",
	"bibreformat","webcoll", "bibtaskex", "bibrank",
	"oaiharvest", "oairepositoryupdater", "inveniogc",
	"webstatadmin", "bibclassify", "bibexport",
	"dbdump", "batchuploader", "bibauthorid", "bibtasklet"],
	"hostname_machine2" : ['bibencode',]
	}

	Share the following directories among Invenio instances:

	/var/tmp-shared
	hosts video uploads in a temporary form
	/var/tmp-shared/bibencode/jobs
	hosts new job files for the video encoding daemon
	/var/tmp-shared/bibencode/jobs/done
	hosts job files that have been processed by the daemon
	/var/data/files
	hosts fulltext and media files associated to records
	/var/data/submit
	hosts files created during submissions

	$ sudo -u www-data /opt/invenio/bin/inveniocfg --update-all

	Make the rest of the Invenio system aware of your
	invenio-local.conf changes. This step is mandatory each
	time you edit your conf files.

	$ sudo -u www-data /opt/invenio/bin/inveniocfg --create-tables

	If you are installing Invenio for the first time, you
	have to create database tables.

	Note that this step checks for potential problems such as
	the database connection rights and may ask you to perform
	some more administrative steps in case it detects a problem.
	Notably, it may ask you to set up database access
	permissions, based on your configure values.

	If you are installing Invenio for the first time, you
	have to create a dedicated database on your MySQL server
	that the Invenio can use for its purposes. Please
	contact your MySQL administrator and ask him to execute the
	commands this step proposes you.

	At this point you should now have successfully completed the
	"make install" process. We continue by setting up the
	Apache web server.

	$ sudo -u www-data /opt/invenio/bin/inveniocfg --load-bibfield-conf

	Load the configuration file of the BibField module. It will
	create `bibfield_config.py' file. (FIXME: When BibField
	becomes essential part of Invenio, this step should be later
	automatised so that people do not have to run it manually.)

	$ sudo -u www-data /opt/invenio/bin/inveniocfg --load-webstat-conf

	Load the configuration file of webstat module. It will create
	the tables in the database for register customevents, such as
	basket hits.

	$ sudo -u www-data /opt/invenio/bin/inveniocfg --create-apache-conf

	Running this command will generate Apache virtual host
	configurations matching your installation. You will be
	instructed to check created files (usually they are located
	under /opt/invenio/etc/apache/) and edit your httpd.conf
	to activate Invenio virtual hosts.

	If you are using Debian GNU/Linux ``Lenny'' or later, then
	you can do the following to create your SSL certificate and
	to activate your Invenio vhosts:

	## make SSL certificate:
	$ sudo aptitude install ssl-cert
	$ sudo mkdir /etc/apache2/ssl
	$ sudo /usr/sbin/make-ssl-cert /usr/share/ssl-cert/ssleay.cnf \
	/etc/apache2/ssl/apache.pem

	## add Invenio web sites:
	$ sudo ln -s /opt/invenio/etc/apache/invenio-apache-vhost.conf \
	/etc/apache2/sites-available/invenio
	$ sudo ln -s /opt/invenio/etc/apache/invenio-apache-vhost-ssl.conf \
	/etc/apache2/sites-available/invenio-ssl

	## disable Debian's default web site:
	$ sudo /usr/sbin/a2dissite default

	## enable Invenio web sites:
	$ sudo /usr/sbin/a2ensite invenio
	$ sudo /usr/sbin/a2ensite invenio-ssl

	## enable SSL module:
	$ sudo /usr/sbin/a2enmod ssl

	## if you are using xsendfile module, enable it too:
	$ sudo /usr/sbin/a2enmod xsendfile

	If you are using another operating system, you should do the
	equivalent, for example edit your system-wide httpd.conf and
	put the following include statements:

	Include /opt/invenio/etc/apache/invenio-apache-vhost.conf
	Include /opt/invenio/etc/apache/invenio-apache-vhost-ssl.conf

	Note that you may need to adapt generated vhost file
	snippets to match your concrete operating system specifics.
	For example, the generated configuration snippet will
	preload Invenio WSGI daemon application upon Apache start up
	for faster site response. The generated configuration
	assumes that you are using mod_wsgi version 3 or later. If
	you are using the old legacy mod_wsgi version 2, then you
	would need to comment out the WSGIImportScript directive
	from the generated snippet, or else move the WSGI daemon
	setup to the top level, outside of the VirtualHost section.

	Note also that you may want to tweak the generated Apache
	vhost snippet for performance reasons, especially with
	respect to WSGIDaemonProcess parameters. For example, you
	can increase the number of processes from the default value
	`processes=5' if you have lots of RAM and if many concurrent
	users may access your site in parallel. However, note that
	you must use `threads=1' there, because Invenio WSGI daemon
	processes are not fully thread safe yet. This may change in
	the future.

	$ sudo /etc/init.d/apache2 restart

	Please ask your webserver administrator to restart the
	Apache server after the above "httpd.conf" changes.

	$ sudo -u www-data /opt/invenio/bin/inveniocfg --check-openoffice

	If you plan to support MS Office or Open Document Format
	files in your installation, you should check whether
	LibreOffice or OpenOffice.org is well integrated with
	Invenio by running the above command. You may be asked to
	create a temporary directory for converting office files
	with special ownership (typically as user nobody) and
	permissions. Note that you can do this step later.

	$ sudo -u www-data /opt/invenio/bin/inveniocfg --create-demo-site

	This step is recommended to test your local Invenio
	installation. It should give you our "Atlantis Institute of
	Science" demo installation, exactly as you see it at
	<http://invenio-demo.cern.ch/>.

	$ sudo -u www-data /opt/invenio/bin/inveniocfg --load-demo-records

	Optionally, load some demo records to be able to test
	indexing and searching of your local Invenio demo
	installation.

	$ sudo -u www-data /opt/invenio/bin/inveniocfg --run-unit-tests

	Optionally, you can run the unit test suite to verify the
	unit behaviour of your local Invenio installation. Note
	that this command should be run only after you have
	installed the whole system via `make install'.

	$ sudo -u www-data /opt/invenio/bin/inveniocfg --run-regression-tests

	Optionally, you can run the full regression test suite to
	verify the functional behaviour of your local Invenio
	installation. Note that this command requires to have
	created the demo site and loaded the demo records. Note
	also that running the regression test suite may alter the
	database content with junk data, so that rebuilding the
	demo site is strongly recommended afterwards.

	$ sudo -u www-data /opt/invenio/bin/inveniocfg --run-web-tests

	Optionally, you can run additional automated web tests
	running in a real browser. This requires to have Firefox
	with the Selenium IDE extension installed.
	<http://en.www.mozilla.com/en/firefox/>
	<http://selenium-ide.openqa.org/>

	$ sudo -u www-data /opt/invenio/bin/inveniocfg --remove-demo-records

	Optionally, remove the demo records loaded in the previous
	step, but keeping otherwise the demo collection, submission,
	format, and other configurations that you may reuse and
	modify for your own production purposes.

	$ sudo -u www-data /opt/invenio/bin/inveniocfg --drop-demo-site

	Optionally, drop also all the demo configuration so that
	you'll end up with a completely blank Invenio system.
	However, you may want to find it more practical not to drop
	the demo site configuration but to start customizing from
	there.

	$ firefox http://your.site.com/help/admin/howto-run

	In order to start using your Invenio installation, you
	can start indexing, formatting and other daemons as
	indicated in the "HOWTO Run" guide on the above URL. You
	can also use the Admin Area web interfaces to perform
	further runtime configurations such as the definition of
	data collections, document types, document formats, word
	indexes, etc.

	$ sudo ln -s /opt/invenio/etc/bash_completion.d/inveniocfg \
	/etc/bash_completion.d/inveniocfg

	Optionally, if you are using Bash shell completion, then
	you may want to create the above symlink in order to
	configure completion for the inveniocfg command.

	Good luck, and thanks for choosing Invenio.

	- Invenio Development Team
	<info@invenio-software.org>
	<http://invenio-software.org/>
	diff --git a/configure.ac b/configure.ac
	index 31e6f8637..26b00ce18 100644
	--- a/configure.ac
	+++ b/configure.ac
	@@ -1,948 +1,949 @@
	## This file is part of Invenio.
	## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013 CERN.
	##
	## Invenio is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## Invenio is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	## This is Invenio main configure.ac file. If you change this
	## file, then please run "autoreconf" to regenerate the "configure"
	## script.

	## Initialize autoconf and automake:
	AC_INIT([invenio],
	m4_esyscmd([./git-version-gen .tarball-version]),
	[info@invenio-software.org])
	AM_INIT_AUTOMAKE([tar-ustar])

	## By default we shall install into /opt/invenio. (Do not use
	## AC_PREFIX_DEFAULT for this, because it would not work well with
	## the localstatedir hack below.)
	test "${prefix}" = NONE && prefix=/opt/invenio

	## Remove eventual trailing slashes from the prefix value:
	test "${prefix%/}" != "" && prefix=${prefix%/}

	## Check for install:
	AC_PROG_INSTALL

	## Check for gettext support:
	AM_GNU_GETTEXT(external)
	AM_GNU_GETTEXT_VERSION(0.14.4)

	## Check for MySQL client:
	AC_MSG_CHECKING(for mysql)
	AC_ARG_WITH(mysql, AC_HELP_STRING([--with-mysql], [path to a specific MySQL binary (optional)]), MYSQL=${withval})
	if test -n "$MYSQL"; then
	AC_MSG_RESULT($MYSQL)
	else
	AC_PATH_PROG(MYSQL, mysql)
	if test -z "$MYSQL"; then
	AC_MSG_ERROR([
	MySQL command-line client was not found in your PATH.
	Please install it first.
	Available from <http://mysql.com/>.])
	fi
	fi

	## Check for Python:
	AC_MSG_CHECKING(for python)
	AC_ARG_WITH(python, AC_HELP_STRING([--with-python], [path to a specific Python binary (optional)]), PYTHON=${withval})
	if test -n "$PYTHON"; then
	AC_MSG_RESULT($PYTHON)
	else
	AC_PATH_PROG(PYTHON, python)
	if test -z "$PYTHON"; then
	AC_MSG_ERROR([
	Python was not found in your PATH. Please either install it
	in your PATH or specify --with-python configure option.
	Python is available from <http://python.org/>.])
	fi
	fi

	## Check for OpenOffice.org Python binary:
	AC_MSG_CHECKING(for OpenOffice.org Python binary)
	AC_ARG_WITH(openoffice-python, AC_HELP_STRING([--with-openoffice-python], [path to a specific OpenOffice.org Python binary (optional)]), OPENOFFICE_PYTHON=`which ${withval}`)

	if test -z "$OPENOFFICE_PYTHON"; then
	OPENOFFICE_PYTHON=`locate -l 1 -r "o.*office/program/python$"`
	OPENOFFICE_PYTHON="$PYTHON $OPENOFFICE_PYTHON"
	if test -n "$OPENOFFICE_PYTHON" && ($OPENOFFICE_PYTHON -c "import uno" 2> /dev/null); then
	AC_MSG_RESULT($OPENOFFICE_PYTHON)
	else
	AC_MSG_WARN([
	You have not specified the path ot the OpenOffice.org Python binary.
	OpenOffice.org and Microsoft Office document conversion and fulltext indexing
	will not be available. We recommend you to install OpenOffice.org first
	and to rerun the configure script. OpenOffice.org is available from
	<http://www.openoffice.org/>.])
	fi
	elif ($OPENOFFICE_PYTHON -c "import uno" 2> /dev/null); then
	AC_MSG_RESULT($OPENOFFICE_PYTHON)
	else
	AC_MSG_ERROR([
	The specified OpenOffice.org Python binary is not correctly configured.
	Please specify the correct path to the specific OpenOffice Python binary
	(OpenOffice.org is available from <http://www.openoffice.org/>).])
	fi

	## Check for Python version and modules:
	AC_MSG_CHECKING(for required Python modules)
	$PYTHON ${srcdir}/configure-tests.py
	if test $? -ne 0; then
	AC_MSG_ERROR([Please fix the above Python problem before continuing.])
	fi
	AC_MSG_RESULT(found)

	## Check for PHP:
	AC_PATH_PROG(PHP, php)

	## Check for gzip:
	AC_PATH_PROG(GZIP, gzip)
	if test -z "$GZIP"; then
	AC_MSG_WARN([
	Gzip was not found in your PATH. It is used in
	the WebSubmit module to compress the data submitted in an archive.
	You can continue without it but you will miss some Invenio
	functionality. We recommend you to install it first and to rerun
	the configure script. Gzip is available from
	<http://www.gzip.org/>.])
	fi

	## Check for gunzip:
	AC_PATH_PROG(GUNZIP, gunzip)
	if test -z "$GUNZIP"; then
	AC_MSG_WARN([
	Gunzip was not found in your PATH. It is used in
	the WebSubmit module to correctly deal with submitted compressed
	files.
	You can continue without it but you will miss some Invenio
	functionality. We recommend you to install it first and to rerun
	the configure script. Gunzip is available from
	<http://www.gzip.org/>.])
	fi

	## Check for tar:
	AC_PATH_PROG(TAR, tar)
	if test -z "$TAR"; then
	AC_MSG_WARN([
	Tar was not found in your PATH. It is used in
	the WebSubmit module to pack the submitted data into an archive.
	You can continue without it but you will miss some Invenio
	functionality. We recommend you to install it first and to rerun
	the configure script. Tar is available from
	<ftp://prep.ai.mit.edu/pub/gnu/tar/>.])
	fi

	## Check for wget:
	AC_PATH_PROG(WGET, wget)
	if test -z "$WGET"; then
	AC_MSG_WARN([
	wget was not found in your PATH. It is used for the fulltext file
	retrieval.
	You can continue without it but we recomend you to install it first
	and to rerun the configure script. wget is available from
	<http://www.gnu.org/software/wget/>.])
	fi

	## Check for md5sum:
	AC_PATH_PROG(MD5SUM, md5sum)
	if test -z "$MD5SUM"; then
	AC_MSG_WARN([
	md5sum was not found in your PATH. It is used for the fulltext file
	checksum verification.
	You can continue without it but we recomend you to install it first
	and to rerun the configure script. md5sum is available from
	<http://www.gnu.org/software/coreutils/>.])
	fi

	## Check for ps2pdf:
	AC_PATH_PROG(PS2PDF, ps2pdf)
	if test -z "$PS2PDF"; then
	AC_MSG_WARN([
	ps2pdf was not found in your PATH. It is used in
	the WebSubmit module to convert submitted PostScripts into PDF.
	You can continue without it but you will miss some Invenio
	functionality. We recommend you to install it first and to rerun
	the configure script. ps2pdf is available from
	<http://www.cs.wisc.edu/~ghost/doc/AFPL/>.])
	fi

	## Check for pdflatex:
	AC_PATH_PROG(PDFLATEX, pdflatex)
	if test -z "$PDFLATEX"; then
	AC_MSG_WARN([
	pdflatex was not found in your PATH. It is used in
	the WebSubmit module to stamp PDF files.
	You can continue without it but you will miss some Invenio
	functionality. We recommend you to install it first and to rerun
	the configure script.])
	fi


	## Check for tiff2pdf:
	AC_PATH_PROG(TIFF2PDF, tiff2pdf)
	if test -z "$TIFF2PDF"; then
	AC_MSG_WARN([
	tiff2pdf was not found in your PATH. It is used in
	the WebSubmit module to convert submitted TIFF file into PDF.
	You can continue without it but you will miss some Invenio
	functionality. We recommend you to install it first and to rerun
	the configure script. tiff2pdf is available from
	<http://www.remotesensing.org/libtiff/>.])
	fi

	## Check for gs:
	AC_PATH_PROG(GS, gs)
	if test -z "$GS"; then
	AC_MSG_WARN([
	gs was not found in your PATH. It is used in
	the WebSubmit module to convert submitted PostScripts into PDF.
	You can continue without it but you will miss some Invenio
	functionality. We recommend you to install it first and to rerun
	the configure script. gs is available from
	<http://www.cs.wisc.edu/~ghost/doc/AFPL/>.])
	fi

	## Check for pdftotext:
	AC_PATH_PROG(PDFTOTEXT, pdftotext)
	if test -z "$PDFTOTEXT"; then
	AC_MSG_WARN([
	pdftotext was not found in your PATH. It is used for the fulltext indexation
	of PDF files.
	You can continue without it but you may miss fulltext searching capability
	of Invenio. We recomend you to install it first and to rerun the configure
	script. pdftotext is available from <http://www.foolabs.com/xpdf/home.html>.
	])
	fi

	## Check for pdftotext:
	AC_PATH_PROG(PDFINFO, pdfinfo)
	if test -z "$PDFINFO"; then
	AC_MSG_WARN([
	pdfinfo was not found in your PATH. It is used for gathering information on
	PDF files.
	You can continue without it but you may miss this feature of Invenio.
	We recomend you to install it first and to rerun the configure
	script. pdftotext is available from <http://www.foolabs.com/xpdf/home.html>.
	])
	fi

	## Check for pdftk:
	AC_PATH_PROG(PDFTK, pdftk)
	if test -z "$PDFTK"; then
	AC_MSG_WARN([
	pdftk was not found in your PATH. It is used for the fulltext file stamping.
	You can continue without it but you may miss this feature of Invenio.
	We recomend you to install it first and to rerun the configure
	script. pdftk is available from <http://www.accesspdf.com/pdftk/>.
	])
	fi

	## Check for pdf2ps:
	AC_PATH_PROG(PDF2PS, pdf2ps)
	if test -z "$PDF2PS"; then
	AC_MSG_WARN([
	pdf2ps was not found in your PATH. It is used in
	the WebSubmit module to convert submitted PDFs into PostScript.
	You can continue without it but you will miss some Invenio
	functionality. We recommend you to install it first and to rerun
	the configure script. pdf2ps is available from
	<http://www.cs.wisc.edu/~ghost/doc/AFPL/>.])
	fi

	## Check for pdftops:
	AC_PATH_PROG(PDFTOPS, pdftops)
	if test -z "$PDFTOPS"; then
	AC_MSG_WARN([
	pdftops was not found in your PATH. It is used in
	the WebSubmit module to convert submitted PDFs into PostScript.
	You can continue without it but you will miss some Invenio
	functionality. We recommend you to install it first and to rerun
	the configure script. pdftops is available from
	<http://poppler.freedesktop.org/>.])
	fi

	## Check for pdfopt:
	AC_PATH_PROG(PDFOPT, pdfopt)
	if test -z "$PDFOPT"; then
	AC_MSG_WARN([
	pdfopt was not found in your PATH. It is used in
	the WebSubmit module to linearized submitted PDFs.
	You can continue without it but you will miss some Invenio
	functionality. We recommend you to install it first and to rerun
	the configure script. pdfopt is available from
	<http://www.cs.wisc.edu/~ghost/doc/AFPL/>.])
	fi

	## Check for pdfimages:
	AC_PATH_PROG(PDFTOPPM, pdftoppm)
	if test -z "$PDFTOPPM"; then
	AC_MSG_WARN([
	pdftoppm was not found in your PATH. It is used in
	the WebSubmit module to extract images from PDFs for OCR.
	You can continue without it but you will miss some Invenio
	functionality. We recommend you to install it first and to rerun
	the configure script. pdftoppm is available from
	<http://poppler.freedesktop.org/>.])
	fi

	## Check for pdfimages:
	AC_PATH_PROG(PAMFILE, pdftoppm)
	if test -z "$PAMFILE"; then
	AC_MSG_WARN([
	pamfile was not found in your PATH. It is used in
	the WebSubmit module to retrieve the size of images extracted from PDFs
	for OCR.
	You can continue without it but you will miss some Invenio
	functionality. We recommend you to install it first and to rerun
	the configure script. pamfile is available as part of the netpbm utilities
	from:
	<http://netpbm.sourceforge.net/>.])
	fi

	## Check for ocroscript:
	AC_PATH_PROG(OCROSCRIPT, ocroscript)
	if test -z "$OCROSCRIPT"; then
	AC_MSG_WARN([
	If you plan to run OCR on your PDFs, then please install
	ocroscript now. Otherwise you can safely continue. You have also an
	option to install ocroscript later and edit invenio-local.conf to let
	Invenio know the path to ocroscript.
	ocroscript is available as part of OCROpus from
	<http://code.google.com/p/ocropus/>.
	NOTE: Since OCROpus is being actively developed and its api is continuosly
	changing, please install relase 0.3.1])
	fi

	## Check for pstotext:
	AC_PATH_PROG(PSTOTEXT, pstotext)
	if test -z "$PSTOTEXT"; then
	AC_MSG_WARN([
	pstotext was not found in your PATH. It is used for the fulltext indexation
	of PDF and PostScript files.
	Please install pstotext. Otherwise you can safely continue. You have also an
	option to install pstotext later and edit invenio-local.conf to let
	Invenio know the path to pstotext.
	pstotext is available from <http://www.cs.wisc.edu/~ghost/doc/AFPL/>.
	])
	fi

	## Check for ps2ascii:
	AC_PATH_PROG(PSTOASCII, ps2ascii)
	if test -z "$PSTOASCII"; then
	AC_MSG_WARN([
	ps2ascii was not found in your PATH. It is used for the fulltext indexation
	of PostScript files.
	Please install ps2ascii. Otherwise you can safely continue. You have also an
	option to install ps2ascii later and edit invenio-local.conf to let
	Invenio know the path to ps2ascii.
	ps2ascii is available from <http://www.cs.wisc.edu/~ghost/doc/AFPL/>.
	])
	fi

	## Check for any2djvu:
	AC_PATH_PROG(ANY2DJVU, any2djvu)
	if test -z "$ANY2DJVU"; then
	AC_MSG_WARN([
	any2djvu was not found in your PATH. It is used in
	the WebSubmit module to convert documents to DJVU.
	Please install any2djvu. Otherwise you can safely continue. You have also an
	option to install any2djvu later and edit invenio-local.conf to let
	Invenio know the path to any2djvu.
	any2djvu is available from
	<http://djvu.sourceforge.net/>.])
	fi

	## Check for DJVUPS:
	AC_PATH_PROG(DJVUPS, djvups)
	if test -z "$DJVUPS"; then
	AC_MSG_WARN([
	djvups was not found in your PATH. It is used in
	the WebSubmit module to convert documents from DJVU.
	Please install djvups. Otherwise you can safely continue. You have also an
	option to install djvups later and edit invenio-local.conf to let
	Invenio know the path to djvups.
	djvups is available from
	<http://djvu.sourceforge.net/>.])
	fi

	## Check for DJVUTXT:
	AC_PATH_PROG(DJVUTXT, djvutxt)
	if test -z "$DJVUTXT"; then
	AC_MSG_WARN([
	djvutxt was not found in your PATH. It is used in
	the WebSubmit module to extract text from DJVU documents.
	You can continue without it but you will miss some Invenio
	functionality. We recommend you to install it first and to rerun
	the configure script. djvutxt is available from
	<http://djvu.sourceforge.net/>.])
	fi

	## Check for file:
	AC_PATH_PROG(FILE, file)
	if test -z "$FILE"; then
	AC_MSG_WARN([
	File was not found in your PATH. It is used in
	the WebSubmit module to check the validity of the submitted files.
	You can continue without it but you will miss some Invenio
	functionality. We recommend you to install it first and to rerun
	the configure script. File is available from
	<ftp://ftp.astron.com/pub/file/>.])
	fi

	## Check for convert:
	AC_PATH_PROG(CONVERT, convert)
	if test -z "$CONVERT"; then
	AC_MSG_WARN([
	Convert was not found in your PATH. It is used in
	the WebSubmit module to create an icon from a submitted picture.
	You can continue without it but you will miss some Invenio
	functionality. We recommend you to install it first and to rerun
	the configure script. Convert is available from
	<http://www.imagemagick.org/>.])
	fi

	## Check for CLISP:
	AC_MSG_CHECKING(for clisp)
	AC_ARG_WITH(clisp, AC_HELP_STRING([--with-clisp], [path to a specific CLISP binary (optional)]), CLISP=${withval})
	if test -n "$CLISP"; then
	AC_MSG_RESULT($CLISP)
	else
	AC_PATH_PROG(CLISP, clisp)
	if test -z "$CLISP"; then
	AC_MSG_WARN([
	GNU CLISP was not found in your PATH. It is used by the WebStat
	module to produce statistics about Invenio usage. (Alternatively,
	SBCL or CMUCL can be used instead of CLISP.)
	You can continue without it but you will miss this feature. We
	recommend you to install it first (if you don't have neither CMUCL
	nor SBCL) and to rerun the configure script.
	GNU CLISP is available from <http://clisp.cons.org/>.])
	fi
	fi

	## Check for CMUCL:
	AC_MSG_CHECKING(for cmucl)
	AC_ARG_WITH(cmucl, AC_HELP_STRING([--with-cmucl], [path to a specific CMUCL binary (optional)]), CMUCL=${withval})
	if test -n "$CMUCL"; then
	AC_MSG_RESULT($CMUCL)
	else
	AC_PATH_PROG(CMUCL, cmucl)
	if test -z "$CMUCL"; then
	AC_MSG_CHECKING(for lisp) # CMUCL can also be installed under `lisp' exec name
	AC_PATH_PROG(CMUCL, lisp)
	fi
	if test -z "$CMUCL"; then
	AC_MSG_WARN([
	CMUCL was not found in your PATH. It is used by the WebStat
	module to produce statistics about Invenio usage. (Alternatively,
	CLISP or SBCL can be used instead of CMUCL.)
	You can continue without it but you will miss this feature. We
	recommend you to install it first (if you don't have neither CLISP
	nor SBCL) and to rerun the configure script.
	CMUCL is available from <http://www.cons.org/cmucl/>.])
	fi
	fi

	## Check for SBCL:
	AC_MSG_CHECKING(for sbcl)
	AC_ARG_WITH(sbcl, AC_HELP_STRING([--with-sbcl], [path to a specific SBCL binary (optional)]), SBCL=${withval})
	if test -n "$SBCL"; then
	AC_MSG_RESULT($SBCL)
	else
	AC_PATH_PROG(SBCL, sbcl)
	if test -z "$SBCL"; then
	AC_MSG_WARN([
	SBCL was not found in your PATH. It is used by the WebStat
	module to produce statistics about Invenio usage. (Alternatively,
	CLISP or CMUCL can be used instead of SBCL.)
	You can continue without it but you will miss this feature. We
	recommend you to install it first (if you don't have neither CLISP
	nor CMUCL) and to rerun the configure script.
	SBCL is available from <http://sbcl.sourceforge.net/>.])
	fi
	fi

	## Check for gnuplot:
	AC_PATH_PROG(GNUPLOT, gnuplot)
	if test -z "$GNUPLOT"; then
	AC_MSG_WARN([
	Gnuplot was not found in your PATH. It is used by the BibRank
	module to produce graphs about download and citation history.
	You can continue without it but you will miss these graphs. We
	recommend you to install it first and to rerun the configure script.
	Gnuplot is available from <http://www.gnuplot.info/>.])
	fi

	## Check for ffmpeg:
	AC_PATH_PROG(FFMPEG, ffmpeg)
	AC_PATH_PROG(FFPROBE, ffprobe)
	if test -z "$FFMPEG"; then
	AC_MSG_WARN([
	FFmpeg was not found in your PATH. It is used by the BibEncode
	module to for video encoding.
	You can continue without but you will not be able to use BibEncode
	and no video submission workflows are thereby possible.
	We recommend you to install it first if you would like to support video
	submissions and to rerun the configure script.
	FFmpeg is available from <http://www.ffmpeg.org/>.])
	fi

	## Check for mediainfo:
	AC_PATH_PROG(MEDIAINFO, mediainfo)
	if test -z "$MEDIAINFO"; then
	AC_MSG_WARN([
	Mediainfo was not found in your PATH. It is used by the BibEncode
	module to for video encoding and media metadata handling.
	You can continue without but you will not be able to use BibEncode
	and no video submission workflows are thereby possible.
	We recommend you to install it first if you would like to support video
	submissions and to rerun the configure script.
	Mediainfo is available from <http://mediainfo.sourceforge.net/>.])
	fi

	## Check for ffmpeg

	## Substitute variables:
	AC_SUBST(VERSION)
	AC_SUBST(OPENOFFICE_PYTHON)
	AC_SUBST(MYSQL)
	AC_SUBST(PYTHON)
	AC_SUBST(GZIP)
	AC_SUBST(GUNZIP)
	AC_SUBST(TAR)
	AC_SUBST(WGET)
	AC_SUBST(MD5SUM)
	AC_SUBST(PS2PDF)
	AC_SUBST(GS)
	AC_SUBST(PDFTOTEXT)
	AC_SUBST(PDFTK)
	AC_SUBST(PDF2PS)
	AC_SUBST(PDFTOPS)
	AC_SUBST(PDFOPT)
	AC_SUBST(PDFTOPPM)
	AC_SUBST(OCROSCRIPT)
	AC_SUBST(PSTOTEXT)
	AC_SUBST(PSTOASCII)
	AC_SUBST(ANY2DJVU)
	AC_SUBST(DJVUPS)
	AC_SUBST(DJVUTXT)
	AC_SUBST(FILE)
	AC_SUBST(CONVERT)
	AC_SUBST(GNUPLOT)
	AC_SUBST(CLISP)
	AC_SUBST(CMUCL)
	AC_SUBST(SBCL)
	AC_SUBST(CACHEDIR)
	AC_SUBST(FFMPEG)
	AC_SUBST(MEDIAINFO)
	AC_SUBST(FFPROBE)
	AC_SUBST(localstatedir, `eval echo "${localstatedir}"`)

	## Define output files:
	AC_CONFIG_FILES([config.nice \
	Makefile \
	po/Makefile.in \
	config/Makefile \
	config/invenio-autotools.conf \
	modules/Makefile \
	modules/webauthorprofile/Makefile \
	modules/webauthorprofile/lib/Makefile \
	modules/webauthorprofile/bin/Makefile \
	modules/webauthorprofile/bin/webauthorprofile \
	modules/bibauthorid/Makefile \
	modules/bibauthorid/bin/Makefile \
	modules/bibauthorid/bin/bibauthorid \
	modules/bibauthorid/doc/Makefile \
	modules/bibauthorid/doc/admin/Makefile \
	modules/bibauthorid/doc/hacking/Makefile \
	modules/bibauthorid/lib/Makefile \
	modules/bibauthorid/etc/Makefile \
	modules/bibauthorid/etc/name_authority_files/Makefile \
	modules/bibauthorid/web/Makefile \
	modules/bibauthority/Makefile \
	modules/bibauthority/bin/Makefile \
	modules/bibauthority/doc/Makefile \
	modules/bibauthority/doc/admin/Makefile \
	modules/bibauthority/doc/hacking/Makefile \
	modules/bibauthority/lib/Makefile \
	modules/bibauthority/web/Makefile \
	modules/bibcatalog/Makefile \
	modules/bibcatalog/doc/Makefile \
	modules/bibcatalog/doc/admin/Makefile \
	modules/bibcatalog/doc/hacking/Makefile
	modules/bibcatalog/lib/Makefile \
	modules/bibcheck/Makefile \
	modules/bibcheck/doc/Makefile \
	modules/bibcheck/doc/admin/Makefile \
	modules/bibcheck/doc/hacking/Makefile \
	modules/bibcheck/etc/Makefile \
	modules/bibcheck/web/Makefile \
	modules/bibcheck/web/admin/Makefile \
	modules/bibcirculation/Makefile \
	modules/bibcirculation/bin/Makefile \
	modules/bibcirculation/bin/bibcircd \
	modules/bibcirculation/doc/Makefile \
	modules/bibcirculation/doc/admin/Makefile \
	modules/bibcirculation/doc/hacking/Makefile
	modules/bibcirculation/lib/Makefile \
	modules/bibcirculation/web/Makefile \
	modules/bibcirculation/web/admin/Makefile \
	modules/bibclassify/Makefile \
	modules/bibclassify/bin/Makefile \
	modules/bibclassify/bin/bibclassify \
	modules/bibclassify/doc/Makefile \
	modules/bibclassify/doc/admin/Makefile \
	modules/bibclassify/doc/hacking/Makefile \
	modules/bibclassify/etc/Makefile \
	modules/bibclassify/lib/Makefile \
	modules/bibconvert/Makefile \
	modules/bibconvert/bin/Makefile \
	modules/bibconvert/bin/bibconvert \
	modules/bibconvert/doc/Makefile \
	modules/bibconvert/doc/admin/Makefile \
	modules/bibconvert/doc/hacking/Makefile \
	modules/bibconvert/etc/Makefile \
	modules/bibconvert/lib/Makefile \
	modules/bibdocfile/Makefile \
	modules/bibdocfile/bin/bibdocfile \
	modules/bibdocfile/bin/Makefile \
	modules/bibdocfile/doc/Makefile \
	modules/bibdocfile/doc/hacking/Makefile \
	modules/bibdocfile/lib/Makefile \
	modules/bibrecord/Makefile \
	modules/bibrecord/bin/Makefile \
	modules/bibrecord/bin/xmlmarc2textmarc \
	modules/bibrecord/bin/textmarc2xmlmarc \
	modules/bibrecord/bin/xmlmarclint \
	modules/bibrecord/doc/Makefile \
	modules/bibrecord/doc/admin/Makefile \
	modules/bibrecord/doc/hacking/Makefile \
	modules/bibrecord/etc/Makefile \
	modules/bibrecord/lib/Makefile \
	modules/bibedit/Makefile \
	modules/bibedit/bin/Makefile \
	modules/bibedit/bin/bibedit \
	modules/bibedit/doc/Makefile \
	modules/bibedit/doc/admin/Makefile \
	modules/bibedit/doc/hacking/Makefile \
	modules/bibedit/etc/Makefile \
	modules/bibedit/lib/Makefile \
	modules/bibedit/web/Makefile \
	modules/bibencode/Makefile \
	modules/bibencode/bin/Makefile \
	modules/bibencode/bin/bibencode \
	modules/bibencode/lib/Makefile \
	modules/bibencode/etc/Makefile \
	modules/bibencode/www/Makefile \
	modules/bibexport/Makefile \
	modules/bibexport/bin/Makefile \
	modules/bibexport/bin/bibexport \
	modules/bibexport/doc/Makefile \
	modules/bibexport/doc/admin/Makefile \
	modules/bibexport/doc/hacking/Makefile
	modules/bibexport/etc/Makefile \
	modules/bibexport/lib/Makefile \
	modules/bibexport/web/Makefile \
	modules/bibexport/web/admin/Makefile \
	modules/bibfield/Makefile \
	modules/bibfield/lib/Makefile \
	modules/bibfield/lib/functions/Makefile \
	modules/bibfield/etc/Makefile \
	modules/bibformat/Makefile \
	modules/bibformat/bin/Makefile \
	modules/bibformat/bin/bibreformat \
	modules/bibformat/doc/Makefile \
	modules/bibformat/doc/admin/Makefile \
	modules/bibformat/doc/hacking/Makefile \
	modules/bibformat/etc/Makefile \
	modules/bibformat/etc/format_templates/Makefile \
	modules/bibformat/etc/output_formats/Makefile \
	modules/bibformat/lib/Makefile \
	modules/bibformat/lib/elements/Makefile \
	modules/bibformat/web/Makefile \
	modules/bibformat/web/admin/Makefile \
	modules/oaiharvest/Makefile \
	modules/oaiharvest/bin/Makefile \
	modules/oaiharvest/bin/oaiharvest \
	modules/oaiharvest/doc/Makefile \
	modules/oaiharvest/doc/admin/Makefile \
	modules/oaiharvest/doc/hacking/Makefile \
	modules/oaiharvest/lib/Makefile \
	modules/oaiharvest/web/Makefile \
	modules/oaiharvest/web/admin/Makefile \
	modules/oairepository/Makefile \
	modules/oairepository/bin/Makefile \
	modules/oairepository/bin/oairepositoryupdater \
	modules/oairepository/doc/Makefile \
	modules/oairepository/doc/admin/Makefile \
	modules/oairepository/doc/hacking/Makefile \
	modules/oairepository/etc/Makefile \
	modules/oairepository/lib/Makefile \
	modules/oairepository/web/Makefile \
	modules/oairepository/web/admin/Makefile \
	modules/bibindex/Makefile \
	modules/bibindex/bin/Makefile \
	modules/bibindex/bin/bibindex \
	modules/bibindex/bin/bibstat \
	modules/bibindex/doc/Makefile \
	modules/bibindex/doc/admin/Makefile \
	modules/bibindex/doc/hacking/Makefile \
	modules/bibindex/lib/Makefile \
	modules/bibindex/lib/tokenizers/Makefile \
	modules/bibindex/web/Makefile \
	modules/bibindex/web/admin/Makefile \
	modules/bibknowledge/Makefile \
	modules/bibknowledge/lib/Makefile \
	modules/bibknowledge/doc/Makefile \
	modules/bibknowledge/doc/admin/Makefile \
	modules/bibknowledge/doc/hacking/Makefile \
	modules/bibmatch/Makefile \
	modules/bibmatch/bin/Makefile \
	modules/bibmatch/bin/bibmatch \
	modules/bibmatch/doc/Makefile \
	modules/bibmatch/doc/admin/Makefile \
	modules/bibmatch/doc/hacking/Makefile \
	modules/bibmatch/etc/Makefile \
	modules/bibmatch/lib/Makefile \
	modules/bibmerge/Makefile \
	modules/bibmerge/bin/Makefile \
	modules/bibmerge/doc/Makefile \
	modules/bibmerge/doc/admin/Makefile \
	modules/bibmerge/doc/hacking/Makefile \
	modules/bibmerge/lib/Makefile \
	modules/bibmerge/web/Makefile \
	modules/bibmerge/web/admin/Makefile \
	modules/bibrank/Makefile \
	modules/bibrank/bin/Makefile \
	modules/bibrank/bin/bibrank \
	modules/bibrank/bin/bibrankgkb \
	modules/bibrank/doc/Makefile \
	modules/bibrank/doc/admin/Makefile \
	modules/bibrank/doc/hacking/Makefile \
	modules/bibrank/etc/Makefile \
	modules/bibrank/etc/bibrankgkb.cfg \
	modules/bibrank/etc/demo_jif.cfg \
	modules/bibrank/etc/template_single_tag_rank_method.cfg \
	modules/bibrank/lib/Makefile \
	modules/bibrank/web/Makefile \
	modules/bibrank/web/admin/Makefile \
	modules/bibsched/Makefile \
	modules/bibsched/bin/Makefile \
	modules/bibsched/bin/bibsched \
	modules/bibsched/bin/bibtaskex \
	modules/bibsched/bin/bibtasklet \
	modules/bibsched/doc/Makefile \
	modules/bibsched/doc/admin/Makefile \
	modules/bibsched/doc/hacking/Makefile \
	modules/bibsched/lib/Makefile \
	modules/bibsched/lib/tasklets/Makefile \
	modules/bibupload/Makefile \
	modules/bibsort/Makefile \
	modules/bibsort/bin/Makefile \
	modules/bibsort/bin/bibsort \
	modules/bibsort/lib/Makefile \
	modules/bibsort/etc/Makefile \
	modules/bibsort/doc/Makefile \
	modules/bibsort/doc/admin/Makefile \
	modules/bibsort/doc/hacking/Makefile \
	modules/bibsort/web/Makefile \
	modules/bibsort/web/admin/Makefile \
	modules/bibsword/Makefile \
	modules/bibsword/bin/Makefile \
	modules/bibsword/bin/bibsword \
	modules/bibsword/doc/Makefile \
	modules/bibsword/doc/admin/Makefile \
	modules/bibsword/doc/hacking/Makefile \
	modules/bibsword/lib/Makefile \
	modules/bibsword/etc/Makefile \
	modules/bibupload/bin/Makefile \
	modules/bibupload/bin/bibupload \
	modules/bibupload/bin/batchuploader \
	modules/bibupload/doc/Makefile \
	modules/bibupload/doc/admin/Makefile \
	modules/bibupload/doc/hacking/Makefile \
	modules/bibupload/lib/Makefile \
	modules/elmsubmit/Makefile \
	modules/elmsubmit/bin/Makefile \
	modules/elmsubmit/bin/elmsubmit \
	modules/elmsubmit/doc/Makefile \
	modules/elmsubmit/doc/admin/Makefile \
	modules/elmsubmit/doc/hacking/Makefile \
	modules/elmsubmit/etc/Makefile \
	modules/elmsubmit/etc/elmsubmit.cfg \
	modules/elmsubmit/lib/Makefile \
	modules/miscutil/Makefile \
	modules/miscutil/bin/Makefile \
	modules/miscutil/bin/dbdump \
	modules/miscutil/bin/dbexec \
	modules/miscutil/bin/inveniocfg \
	modules/miscutil/bin/plotextractor \
	modules/miscutil/bin/hepdataharvest \
	modules/miscutil/demo/Makefile \
	modules/miscutil/doc/Makefile \
	modules/miscutil/doc/hacking/Makefile \
	modules/miscutil/etc/Makefile \
	modules/miscutil/etc/bash_completion.d/Makefile \
	modules/miscutil/etc/bash_completion.d/inveniocfg \
	modules/miscutil/etc/ckeditor_scientificchar/Makefile \
	modules/miscutil/etc/ckeditor_scientificchar/dialogs/Makefile \
	modules/miscutil/etc/ckeditor_scientificchar/lang/Makefile \
	modules/miscutil/lib/Makefile \
	modules/miscutil/lib/upgrades/Makefile \
	modules/miscutil/sql/Makefile \
	modules/miscutil/web/Makefile \
	modules/webaccess/Makefile \
	modules/webaccess/bin/Makefile \
	modules/webaccess/bin/authaction \
	modules/webaccess/bin/webaccessadmin \
	modules/webaccess/doc/Makefile \
	modules/webaccess/doc/admin/Makefile \
	modules/webaccess/doc/hacking/Makefile \
	modules/webaccess/lib/Makefile \
	modules/webaccess/web/Makefile \
	modules/webaccess/web/admin/Makefile \
	modules/webalert/Makefile \
	modules/webalert/bin/Makefile \
	modules/webalert/bin/alertengine \
	modules/webalert/doc/Makefile \
	modules/webalert/doc/admin/Makefile \
	modules/webalert/doc/hacking/Makefile \
	modules/webalert/lib/Makefile \
	modules/webalert/web/Makefile \
	modules/webbasket/Makefile \
	modules/webbasket/doc/Makefile \
	modules/webbasket/doc/admin/Makefile \
	modules/webbasket/doc/hacking/Makefile \
	modules/webbasket/lib/Makefile \
	modules/webbasket/web/Makefile \
	modules/webcomment/Makefile \
	modules/webcomment/doc/Makefile \
	modules/webcomment/doc/admin/Makefile \
	modules/webcomment/doc/hacking/Makefile \
	modules/webcomment/lib/Makefile \
	modules/webcomment/web/Makefile \
	modules/webcomment/web/admin/Makefile \
	modules/webhelp/Makefile \
	modules/webhelp/web/Makefile \
	modules/webhelp/web/admin/Makefile \
	modules/webhelp/web/admin/howto/Makefile \
	modules/webhelp/web/hacking/Makefile \
	modules/webjournal/Makefile \
	modules/webjournal/etc/Makefile \
	modules/webjournal/doc/Makefile \
	modules/webjournal/doc/admin/Makefile \
	modules/webjournal/doc/hacking/Makefile \
	modules/webjournal/lib/Makefile \
	modules/webjournal/lib/elements/Makefile \
	modules/webjournal/lib/widgets/Makefile \
	modules/webjournal/web/Makefile \
	modules/webjournal/web/admin/Makefile \
	modules/weblinkback/Makefile \
	modules/weblinkback/lib/Makefile \
	modules/weblinkback/web/Makefile \
	modules/weblinkback/web/admin/Makefile \
	modules/webmessage/Makefile \
	modules/webmessage/bin/Makefile \
	modules/webmessage/bin/webmessageadmin \
	modules/webmessage/doc/Makefile \
	modules/webmessage/doc/admin/Makefile \
	modules/webmessage/doc/hacking/Makefile \
	modules/webmessage/lib/Makefile \
	modules/webmessage/web/Makefile \
	modules/websearch/Makefile \
	modules/websearch/bin/Makefile \
	modules/websearch/bin/webcoll \
	modules/websearch/doc/Makefile \
	modules/websearch/doc/admin/Makefile \
	modules/websearch/doc/hacking/Makefile \
	modules/websearch/lib/Makefile \
	modules/websearch/lib/services/Makefile \
	modules/websearch/web/Makefile \
	modules/websearch/web/admin/Makefile \
	modules/websession/Makefile \
	modules/websession/bin/Makefile \
	modules/websession/bin/inveniogc \
	modules/websession/doc/Makefile \
	modules/websession/doc/admin/Makefile \
	modules/websession/doc/hacking/Makefile \
	modules/websession/lib/Makefile \
	modules/websession/web/Makefile \
	modules/webstat/Makefile \
	modules/webstat/bin/Makefile \
	modules/webstat/bin/webstat \
	modules/webstat/bin/webstatadmin \
	modules/webstat/doc/Makefile \
	modules/webstat/doc/admin/Makefile \
	modules/webstat/doc/hacking/Makefile \
	modules/webstat/etc/Makefile \
	modules/webstat/lib/Makefile \
	modules/webstyle/Makefile \
	modules/webstyle/bin/Makefile \
	modules/webstyle/bin/gotoadmin \
	modules/webstyle/bin/webdoc \
	modules/webstyle/css/Makefile \
	modules/webstyle/doc/Makefile \
	modules/webstyle/doc/admin/Makefile \
	modules/webstyle/doc/hacking/Makefile \
	modules/webstyle/etc/Makefile \
	modules/webstyle/img/Makefile \
	modules/webstyle/lib/Makefile \
	modules/webstyle/lib/goto_plugins/Makefile \
	modules/websubmit/Makefile \
	modules/websubmit/bin/Makefile \
	modules/websubmit/bin/inveniounoconv \
	modules/websubmit/bin/websubmitadmin \
	modules/websubmit/doc/Makefile \
	modules/websubmit/doc/admin/Makefile \
	modules/websubmit/doc/hacking/Makefile \
	modules/websubmit/etc/Makefile \
	modules/websubmit/lib/Makefile \
	modules/websubmit/lib/functions/Makefile \
	modules/websubmit/web/Makefile \
	modules/websubmit/web/admin/Makefile \
	modules/docextract/Makefile \
	modules/docextract/bin/Makefile \
	modules/docextract/bin/docextract \
	modules/docextract/bin/refextract \
	+ modules/docextract/bin/convert_journals \
	modules/docextract/doc/Makefile \
	modules/docextract/lib/Makefile \
	modules/docextract/etc/Makefile \
	modules/docextract/doc/admin/Makefile \
	modules/docextract/doc/hacking/Makefile \
	])

	## Finally, write output files:
	AC_OUTPUT

	## Write help:
	AC_MSG_RESULT([****************************************************************************])
	AC_MSG_RESULT([ Your Invenio installation is now ready for building. ])
	AC_MSG_RESULT([ You have entered the following parameters: ])
	AC_MSG_RESULT([** - Invenio main install directory: ${prefix}])
	AC_MSG_RESULT([** - Python executable: $PYTHON])
	AC_MSG_RESULT([** - MySQL client executable: $MYSQL])
	AC_MSG_RESULT([** - CLISP executable: $CLISP])
	AC_MSG_RESULT([** - CMUCL executable: $CMUCL])
	AC_MSG_RESULT([** - SBCL executable: $SBCL])
	AC_MSG_RESULT([ Here are the steps to continue the building process: ])
	AC_MSG_RESULT([ 1) Type 'make' to build your Invenio system. ])
	AC_MSG_RESULT([ 2) Type 'make install' to install your Invenio system. ])
	AC_MSG_RESULT([ After that you can start customizing your installation as documented ])
	AC_MSG_RESULT([ in the INSTALL file (i.e. edit invenio.conf, run inveniocfg, etc). ])
	AC_MSG_RESULT([ Good luck, and thanks for choosing Invenio. ])
	AC_MSG_RESULT([ -- Invenio Development Team <info@invenio-software.org> ])
	AC_MSG_RESULT([****************************************************************************])

	## end of file
	diff --git a/modules/bibedit/lib/bibedit_utils.py b/modules/bibedit/lib/bibedit_utils.py
	index ab7e1b5ce..b4184391e 100644
	--- a/modules/bibedit/lib/bibedit_utils.py
	+++ b/modules/bibedit/lib/bibedit_utils.py
	@@ -1,1037 +1,1037 @@
	## This file is part of Invenio.
	## Copyright (C) 2008, 2009, 2010, 2011, 2013 CERN.
	##
	## Invenio is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## Invenio is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	# pylint: disable=C0103
	"""BibEdit Utilities.

	This module contains support functions (i.e., those that are not called directly
	by the web interface), that might be imported by other modules or that is called
	by both the web and CLI interfaces.

	"""

	__revision__ = "$Id$"

	import cPickle
	import difflib
	import fnmatch
	import marshal
	import os
	import re
	import time
	import zlib
	import tempfile
	import sys
	from datetime import datetime

	try:
	from cStringIO import StringIO
	except ImportError:
	from StringIO import StringIO

	from invenio.bibedit_config import CFG_BIBEDIT_FILENAME, \
	CFG_BIBEDIT_RECORD_TEMPLATES_PATH, CFG_BIBEDIT_TO_MERGE_SUFFIX, \
	CFG_BIBEDIT_FIELD_TEMPLATES_PATH, CFG_BIBEDIT_AJAX_RESULT_CODES_REV, \
	CFG_BIBEDIT_CACHEDIR
	from invenio.bibedit_dblayer import get_record_last_modification_date, \
	delete_hp_change
	from invenio.bibrecord import create_record, create_records, \
	record_get_field_value, record_has_field, record_xml_output, \
	record_strip_empty_fields, record_strip_empty_volatile_subfields, \
	record_order_subfields, record_get_field_instances, \
	record_add_field, field_get_subfield_codes, field_add_subfield, \
	field_get_subfield_values, record_delete_fields, record_add_fields, \
	record_get_field_values, print_rec, record_modify_subfield, \
	record_modify_controlfield
	from invenio.bibtask import task_low_level_submission
	from invenio.config import CFG_BIBEDIT_LOCKLEVEL, \
	CFG_BIBEDIT_TIMEOUT, CFG_BIBUPLOAD_EXTERNAL_OAIID_TAG as OAIID_TAG, \
	CFG_BIBUPLOAD_EXTERNAL_SYSNO_TAG as SYSNO_TAG, \
	CFG_BIBEDIT_QUEUE_CHECK_METHOD, \
	CFG_BIBEDIT_EXTEND_RECORD_WITH_COLLECTION_TEMPLATE, CFG_INSPIRE_SITE
	from invenio.dateutils import convert_datetext_to_dategui
	from invenio.textutils import wash_for_xml
	from invenio.bibedit_dblayer import get_bibupload_task_opts, \
	get_marcxml_of_record_revision, get_record_revisions, \
	get_info_of_record_revision
	from invenio.search_engine import print_record, record_exists, get_colID, \
	guess_primary_collection_of_a_record, get_record, \
	get_all_collections_of_a_record
	from invenio.search_engine_utils import get_fieldvalues
	from invenio.webuser import get_user_info, getUid, get_email
	from invenio.dbquery import run_sql
	from invenio.websearchadminlib import get_detailed_page_tabs
	from invenio.access_control_engine import acc_authorize_action
	from invenio.refextract_api import extract_references_from_record_xml, \
	extract_references_from_string_xml, \
	extract_references_from_url_xml
	from invenio.textmarc2xmlmarc import transform_file, ParseError
	from invenio.bibauthorid_name_utils import split_name_parts, \
	create_normalized_name
	from invenio.bibknowledge import get_kbr_values

	# Precompile regexp:
	re_file_option = re.compile(r'^%s' % CFG_BIBEDIT_CACHEDIR)
	re_xmlfilename_suffix = re.compile('_(\d+)_\d+\.xml$')
	re_revid_split = re.compile('^(\d+)\.(\d{14})$')
	re_revdate_split = re.compile('^(\d\d\d\d)(\d\d)(\d\d)(\d\d)(\d\d)(\d\d)')
	re_taskid = re.compile('ID="(\d+)"')
	re_tmpl_name = re.compile('<!-- BibEdit-Template-Name: (.*) -->')
	re_tmpl_description = re.compile('<!-- BibEdit-Template-Description: (.*) -->')
	re_ftmpl_name = re.compile('<!-- BibEdit-Field-Template-Name: (.*) -->')
	re_ftmpl_description = re.compile('<!-- BibEdit-Field-Template-Description: (.*) -->')


	VOLATILE_PREFIX = "VOLATILE:"

	# Authorization

	def user_can_edit_record_collection(req, recid):
	""" Check if user has authorization to modify a collection
	the recid belongs to
	"""
	def remove_volatile(field_value):
	""" Remove volatile keyword from field value """
	if field_value.startswith(VOLATILE_PREFIX):
	field_value = field_value[len(VOLATILE_PREFIX):]
	return field_value

	# Get the collections the record belongs to
	record_collections = get_all_collections_of_a_record(recid)

	uid = getUid(req)
	# In case we are creating a new record
	if cache_exists(recid, uid):
	dummy1, dummy2, record, dummy3, dummy4, dummy5, dummy6 = get_cache_file_contents(recid, uid)
	values = record_get_field_values(record, '980', code="a")
	record_collections.extend([remove_volatile(v) for v in values])

	normalized_collections = []
	for collection in record_collections:
	# Get the normalized collection name present in the action table
	res = run_sql("""SELECT value FROM accARGUMENT
	WHERE keyword='collection'
	AND value=%s;""", (collection,))
	if res:
	normalized_collections.append(res[0][0])
	if not normalized_collections:
	# Check if user has access to all collections
	auth_code, auth_message = acc_authorize_action(req, 'runbibedit',
	collection='')
	if auth_code == 0:
	return True
	else:
	for collection in normalized_collections:
	auth_code, auth_message = acc_authorize_action(req, 'runbibedit',
	collection=collection)
	if auth_code == 0:
	return True
	return False

	# Helper functions

	def assert_undo_redo_lists_correctness(undo_list, redo_list):
	for undoItem in undo_list:
	assert undoItem != None;
	for redoItem in redo_list:
	assert redoItem != None;

	def record_find_matching_fields(key, rec, tag="", ind1=" ", ind2=" ", \
	exact_match=False):
	"""
	This utility function will look for any fieldvalues containing or equal
	to, if exact match is wanted, given keyword string. The found fields will be
	returned as a list of field instances per tag. The fields to search can be
	narrowed down to tag/indicator level.

	@param key: keyword to search for
	@type key: string

	@param rec: a record structure as returned by bibrecord.create_record()
	@type rec: dict

	@param tag: a 3 characters long string
	@type tag: string

	@param ind1: a 1 character long string
	@type ind1: string

	@param ind2: a 1 character long string
	@type ind2: string

	@return: a list of found fields in a tuple per tag: (tag, field_instances) where
	field_instances is a list of (Subfields, ind1, ind2, value, field_position_global)
	and subfields is list of (code, value)
	@rtype: list
	"""
	if not tag:
	all_field_instances = rec.items()
	else:
	all_field_instances = [(tag, record_get_field_instances(rec, tag, ind1, ind2))]
	matching_field_instances = []
	for current_tag, field_instances in all_field_instances:
	found_fields = []
	for field_instance in field_instances:
	# Get values to match: controlfield_value + subfield values
	values_to_match = [field_instance[3]] + \
	[val for code, val in field_instance[0]]
	if exact_match and key in values_to_match:
	found_fields.append(field_instance)
	else:
	for value in values_to_match:
	if value.find(key) > -1:
	found_fields.append(field_instance)
	break
	if len(found_fields) > 0:
	matching_field_instances.append((current_tag, found_fields))
	return matching_field_instances

	# Operations on the BibEdit cache file
	def cache_exists(recid, uid):
	"""Check if the BibEdit cache file exists."""
	return os.path.isfile('%s.tmp' % _get_file_path(recid, uid))

	def get_cache_mtime(recid, uid):
	"""Get the last modified time of the BibEdit cache file. Check that the
	cache exists before calling this function.

	"""
	try:
	return int(os.path.getmtime('%s.tmp' % _get_file_path(recid, uid)))
	except OSError:
	pass

	def cache_expired(recid, uid):
	"""Has it been longer than the number of seconds given by
	CFG_BIBEDIT_TIMEOUT since last cache update? Check that the
	cache exists before calling this function.

	"""
	return get_cache_mtime(recid, uid) < int(time.time()) - CFG_BIBEDIT_TIMEOUT

	def create_cache_file(recid, uid, record='', cache_dirty=False, pending_changes=[], disabled_hp_changes = {}, undo_list = [], redo_list=[]):
	"""Create a BibEdit cache file, and return revision and record. This will
	overwrite any existing cache the user has for this record.
	datetime.

	"""
	if not record:
	record = get_bibrecord(recid)
	if not record:
	return

	file_path = '%s.tmp' % _get_file_path(recid, uid)
	record_revision = get_record_last_modification_date(recid)
	if record_revision == None:
	record_revision = datetime.now().timetuple()

	cache_file = open(file_path, 'w')
	assert_undo_redo_lists_correctness(undo_list, redo_list)

	# Order subfields alphabetically after loading the record
	record_order_subfields(record)

	cPickle.dump([cache_dirty, record_revision, record, pending_changes, disabled_hp_changes, undo_list, redo_list], cache_file)
	cache_file.close()
	return record_revision, record

	def touch_cache_file(recid, uid):
	"""Touch a BibEdit cache file. This should be used to indicate that the
	user has again accessed the record, so that locking will work correctly.

	"""
	if cache_exists(recid, uid):
	os.system('touch %s.tmp' % _get_file_path(recid, uid))

	def get_bibrecord(recid):
	"""Return record in BibRecord wrapping."""
	if record_exists(recid):
	return create_record(print_record(recid, 'xm'))[0]

	def get_cache_file_contents(recid, uid):
	"""Return the contents of a BibEdit cache file."""
	cache_file = _get_cache_file(recid, uid, 'r')
	if cache_file:
	cache_dirty, record_revision, record, pending_changes, disabled_hp_changes, undo_list, redo_list = cPickle.load(cache_file)
	cache_file.close()
	assert_undo_redo_lists_correctness(undo_list, redo_list)

	return cache_dirty, record_revision, record, pending_changes, disabled_hp_changes, undo_list, redo_list

	def update_cache_file_contents(recid, uid, record_revision, record, pending_changes, disabled_hp_changes, undo_list, redo_list):
	"""Save updates to the record in BibEdit cache. Return file modificaton
	time.

	"""
	cache_file = _get_cache_file(recid, uid, 'w')
	if cache_file:
	assert_undo_redo_lists_correctness(undo_list, redo_list)
	cPickle.dump([True, record_revision, record, pending_changes, disabled_hp_changes, undo_list, redo_list], cache_file)
	cache_file.close()
	return get_cache_mtime(recid, uid)

	def delete_cache_file(recid, uid):
	"""Delete a BibEdit cache file."""
	try:
	os.remove('%s.tmp' % _get_file_path(recid, uid))
	except OSError:
	# File was probably already removed
	pass


	def delete_disabled_changes(used_changes):
	for change_id in used_changes:
	delete_hp_change(change_id)

	def save_xml_record(recid, uid, xml_record='', to_upload=True, to_merge=False):
	"""Write XML record to file. Default behaviour is to read the record from
	a BibEdit cache file, filter out the unchanged volatile subfields,
	write it back to an XML file and then pass this file to BibUpload.

	@param xml_record: give XML as string in stead of reading cache file
	@param to_upload: pass the XML file to BibUpload
	@param to_merge: prepare an XML file for BibMerge to use

	"""
	if not xml_record:
	# Read record from cache file.
	cache = get_cache_file_contents(recid, uid)
	if cache:
	record = cache[2]
	used_changes = cache[4]
	xml_record = record_xml_output(record)
	delete_cache_file(recid, uid)
	delete_disabled_changes(used_changes)
	else:
	record = create_record(xml_record)[0]

	# clean the record from unfilled volatile fields
	record_strip_empty_volatile_subfields(record)
	record_strip_empty_fields(record)

	# order subfields alphabetically before saving the record
	record_order_subfields(record)

	xml_to_write = wash_for_xml(record_xml_output(record))

	# Write XML file.
	if not to_merge:
	file_path = '%s.xml' % _get_file_path(recid, uid)
	else:
	file_path = '%s_%s.xml' % (_get_file_path(recid, uid),
	CFG_BIBEDIT_TO_MERGE_SUFFIX)
	xml_file = open(file_path, 'w')
	xml_file.write(xml_to_write)
	xml_file.close()

	user_name = get_user_info(uid)[1]
	if to_upload:
	# Pass XML file to BibUpload.
	task_low_level_submission('bibupload', 'bibedit', '-P', '5', '-r',
	file_path, '-u', user_name)
	return True


	# Security: Locking and integrity
	def latest_record_revision(recid, revision_time):
	"""Check if timetuple REVISION_TIME matches latest modification date."""
	latest = get_record_last_modification_date(recid)
	# this can be none if the record is new
	return (latest == None) or (revision_time == latest)

	def record_locked_by_other_user(recid, uid):
	"""Return true if any other user than UID has active caches for record
	RECID.

	"""
	active_uids = _uids_with_active_caches(recid)
	try:
	active_uids.remove(uid)
	except ValueError:
	pass
	return bool(active_uids)


	def get_record_locked_since(recid, uid):
	""" Get modification time for the given recid and uid
	"""
	filename = "%s_%s_%s.tmp" % (CFG_BIBEDIT_FILENAME,
	recid,
	uid)
	locked_since = ""
	try:
	locked_since = time.ctime(os.path.getmtime('%s%s%s' % (
	CFG_BIBEDIT_CACHEDIR, os.sep, filename)))
	except OSError:
	pass
	return locked_since


	def record_locked_by_user_details(recid, uid):
	""" Get the details about the user that has locked a record and the
	time the record has been locked.
	@return: user details and time when record was locked
	@rtype: tuple
	"""
	active_uids = _uids_with_active_caches(recid)
	try:
	active_uids.remove(uid)
	except ValueError:
	pass

	record_blocked_by_nickname = record_blocked_by_email = locked_since = ""

	if active_uids:
	record_blocked_by_uid = active_uids[0]
	record_blocked_by_nickname = get_user_info(record_blocked_by_uid)[1]
	record_blocked_by_email = get_email(record_blocked_by_uid)
	locked_since = get_record_locked_since(recid, record_blocked_by_uid)

	return record_blocked_by_nickname, record_blocked_by_email, locked_since


	def record_locked_by_queue(recid):
	"""Check if record should be locked for editing because of the current state
	of the BibUpload queue. The level of checking is based on
	CFG_BIBEDIT_LOCKLEVEL.

	"""
	# Check for any scheduled bibupload tasks.
	if CFG_BIBEDIT_LOCKLEVEL == 2:
	return _get_bibupload_task_ids()

	filenames = _get_bibupload_filenames()
	# Check for match between name of XML-files and record.
	# Assumes that filename ends with _<recid>.xml.
	if CFG_BIBEDIT_LOCKLEVEL == 1:
	recids = []
	for filename in filenames:
	filename_suffix = re_xmlfilename_suffix.search(filename)
	if filename_suffix:
	recids.append(int(filename_suffix.group(1)))
	return recid in recids

	# Check for match between content of files and record.
	if CFG_BIBEDIT_LOCKLEVEL == 3:
	while True:
	lock = _record_in_files_p(recid, filenames)
	# Check if any new files were added while we were searching
	if not lock:
	filenames_updated = _get_bibupload_filenames()
	for filename in filenames_updated:
	if not filename in filenames:
	break
	else:
	return lock
	else:
	return lock

	# History/revisions

	def revision_to_timestamp(td):
	"""
	Converts the revision date to the timestamp
	"""
	return "%04i%02i%02i%02i%02i%02i" % (td.tm_year, td.tm_mon, td.tm_mday, \
	td.tm_hour, td.tm_min, td.tm_sec)

	def timestamp_to_revision(timestamp):
	"""
	Converts the timestamp to a correct revision date
	"""
	year = int(timestamp[0:4])
	month = int(timestamp[4:6])
	day = int(timestamp[6:8])
	hour = int(timestamp[8:10])
	minute = int(timestamp[10:12])
	second = int(timestamp[12:14])
	return datetime(year, month, day, hour, minute, second).timetuple()

	def get_record_revision_timestamps(recid):
	"""return list of timestamps describing teh revisions of a given record"""
	rev_ids = get_record_revision_ids(recid)
	result = []
	for rev_id in rev_ids:
	result.append(rev_id.split(".")[1])
	return result

	def get_record_revision_ids(recid):
	"""Return list of all record revision IDs.
	Return revision IDs in chronologically decreasing order (latest first).
	"""
	res = []
	tmp_res = get_record_revisions(recid)
	for row in tmp_res:
	res.append('%s.%s' % (row[0], row[1]))
	return res

	def get_marcxml_of_revision(recid, revid):
	"""Return MARCXML string of revision.
	Return empty string if revision does not exist. REVID should be a string.
	"""
	res = ''
	tmp_res = get_marcxml_of_record_revision(recid, revid)
	if tmp_res:
	for row in tmp_res:
	res += zlib.decompress(row[0]) + '\n'
	return res;

	def get_marcxml_of_revision_id(revid):
	"""Return MARCXML string of revision.
	Return empty string if revision does not exist. REVID should be a string.
	"""
	recid, job_date = split_revid(revid, 'datetext')
	return get_marcxml_of_revision(recid, job_date);

	def get_info_of_revision_id(revid):
	"""Return info string regarding revision.
	Return empty string if revision does not exist. REVID should be a string.
	"""
	recid, job_date = split_revid(revid, 'datetext')
	res = ''
	tmp_res = get_info_of_record_revision(recid, job_date)
	if tmp_res:
	task_id = str(tmp_res[0][0])
	author = tmp_res[0][1]
	if not author:
	author = 'N/A'
	res += '%s %s %s' % (revid.ljust(22), task_id.ljust(15), author.ljust(15))
	job_details = tmp_res[0][2].split()
	upload_mode = job_details[0] + job_details[1][:-1]
	upload_file = job_details[2] + job_details[3][:-1]
	res += '%s %s' % (upload_mode, upload_file)
	return res

	def revision_format_valid_p(revid):
	"""Test validity of revision ID format (=RECID.REVDATE)."""
	if re_revid_split.match(revid):
	return True
	return False

	def record_revision_exists(recid, revid):
	results = get_record_revisions(recid)
	for res in results:
	if res[1] == revid:
	return True
	return False

	def split_revid(revid, dateformat=''):
	"""Split revid and return tuple (recid, revdate).
	Optional dateformat can be datetext or dategui.

	"""
	recid, revdate = re_revid_split.search(revid).groups()
	if dateformat:
	datetext = '%s-%s-%s %s:%s:%s' % re_revdate_split.search(
	revdate).groups()
	if dateformat == 'datetext':
	revdate = datetext
	elif dateformat == 'dategui':
	revdate = convert_datetext_to_dategui(datetext, secs=True)
	return recid, revdate


	def modify_record_timestamp(revision_xml, last_revision_ts):
	""" Modify tag 005 to add the revision passed as parameter.
	@param revision_xml: marcxml representation of the record to modify
	@type revision_xml: string
	@param last_revision_ts: timestamp to add to 005 tag
	@type last_revision_ts: string

	@return: marcxml with 005 tag modified
	"""
	recstruct = create_record(revision_xml)[0]
	record_modify_controlfield(recstruct, "005", last_revision_ts,
	field_position_local=0)
	return record_xml_output(recstruct)


	def get_xml_comparison(header1, header2, xml1, xml2):
	"""Return diff of two MARCXML records."""
	return ''.join(difflib.unified_diff(xml1.splitlines(1),
	xml2.splitlines(1), header1, header2))

	#Templates
	def get_templates(templatesDir, tmpl_name, tmpl_description, extractContent = False):
	"""Return list of templates [filename, name, description, content*]
	the extractContent variable indicated if the parsed content should
	be included"""
	template_fnames = fnmatch.filter(os.listdir(
	templatesDir), '*.xml')

	templates = []
	for fname in template_fnames:
	filepath = '%s%s%s' % (templatesDir, os.sep, fname)
	template_file = open(filepath,'r')
	template = template_file.read()
	template_file.close()
	fname_stripped = os.path.splitext(fname)[0]
	mo_name = tmpl_name.search(template)
	mo_description = tmpl_description.search(template)
	date_modified = time.ctime(os.path.getmtime(filepath))
	if mo_name:
	name = mo_name.group(1)
	else:
	name = fname_stripped
	if mo_description:
	description = mo_description.group(1)
	else:
	description = ''
	if (extractContent):
	parsedTemplate = create_record(template)[0]
	if parsedTemplate != None:
	# If the template was correct
	templates.append([fname_stripped, name, description, parsedTemplate])
	else:
	raise "Problem when parsing the template %s" % (fname, )
	else:
	templates.append([fname_stripped, name, description, date_modified])

	return templates

	# Field templates

	def get_field_templates():
	"""Returns list of field templates [filename, name, description, content]"""
	return get_templates(CFG_BIBEDIT_FIELD_TEMPLATES_PATH, re_ftmpl_name, re_ftmpl_description, True)

	# Record templates
	def get_record_templates():
	"""Return list of record template [filename, name, description] ."""
	return get_templates(CFG_BIBEDIT_RECORD_TEMPLATES_PATH, re_tmpl_name, re_tmpl_description, False)


	def get_record_template(name):
	"""Return an XML record template."""
	filepath = '%s%s%s.xml' % (CFG_BIBEDIT_RECORD_TEMPLATES_PATH, os.sep, name)
	if os.path.isfile(filepath):
	template_file = open(filepath, 'r')
	template = template_file.read()
	template_file.close()
	return template


	# Private functions
	def _get_cache_file(recid, uid, mode):
	"""Return a BibEdit cache file object."""
	if cache_exists(recid, uid):
	return open('%s.tmp' % _get_file_path(recid, uid), mode)

	def _get_file_path(recid, uid, filename=''):
	"""Return the file path to a BibEdit file (excluding suffix).
	If filename is specified this replaces the config default.

	"""
	if not filename:
	return '%s%s%s_%s_%s' % (CFG_BIBEDIT_CACHEDIR, os.sep, CFG_BIBEDIT_FILENAME,
	recid, uid)
	else:
	return '%s%s%s_%s_%s' % (CFG_BIBEDIT_CACHEDIR, os.sep, filename, recid, uid)

	def _uids_with_active_caches(recid):
	"""Return list of uids with active caches for record RECID. Active caches
	are caches that have been modified a number of seconds ago that is less than
	the one given by CFG_BIBEDIT_TIMEOUT.

	"""
	re_tmpfilename = re.compile('%s_%s_(\d+)\.tmp' % (CFG_BIBEDIT_FILENAME,
	recid))
	tmpfiles = fnmatch.filter(os.listdir(CFG_BIBEDIT_CACHEDIR), '%s*.tmp' %
	CFG_BIBEDIT_FILENAME)
	expire_time = int(time.time()) - CFG_BIBEDIT_TIMEOUT
	active_uids = []
	for tmpfile in tmpfiles:
	mo = re_tmpfilename.match(tmpfile)
	if mo and int(os.path.getmtime('%s%s%s' % (
	CFG_BIBEDIT_CACHEDIR, os.sep, tmpfile))) > expire_time:
	active_uids.append(int(mo.group(1)))
	return active_uids

	def _get_bibupload_task_ids():
	"""Return list of all BibUpload task IDs.
	Ignore tasks submitted by user bibreformat.

	"""
	res = run_sql('''SELECT id FROM schTASK WHERE proc LIKE "bibupload%" AND user <> "bibreformat" AND status IN ("WAITING", "SCHEDULED", "RUNNING", "CONTINUING", "ABOUT TO STOP", "ABOUT TO SLEEP", "SLEEPING")''')
	return [row[0] for row in res]

	def _get_bibupload_filenames():
	"""Return paths to all files scheduled for upload."""
	task_ids = _get_bibupload_task_ids()
	filenames = []
	tasks_opts = get_bibupload_task_opts(task_ids)
	for task_opts in tasks_opts:
	if task_opts:
	record_options = marshal.loads(task_opts[0][0])
	for option in record_options[1:]:
	if re_file_option.search(option):
	filenames.append(option)
	return filenames

	def _record_in_files_p(recid, filenames):
	"""Search XML files for given record."""
	# Get id tags of record in question
	rec_oaiid = rec_sysno = -1
	rec_oaiid_tag = get_fieldvalues(recid, OAIID_TAG)
	if rec_oaiid_tag:
	rec_oaiid = rec_oaiid_tag[0]
	rec_sysno_tag = get_fieldvalues(recid, SYSNO_TAG)
	if rec_sysno_tag:
	rec_sysno = rec_sysno_tag[0]

	# For each record in each file, compare ids and abort if match is found
	for filename in filenames:
	try:
	if CFG_BIBEDIT_QUEUE_CHECK_METHOD == 'regexp':
	# check via regexp: this is fast, but may not be precise
	re_match_001 = re.compile('<controlfield tag="001">%s</controlfield>' % (recid))
	re_match_oaiid = re.compile('<datafield tag="%s" ind1=" " ind2=" ">(\s<subfield code="a">\s\|\s<subfield code="9">\s.\s</subfield>\s<subfield code="a">\s)%s' % (OAIID_TAG[0:3],rec_oaiid))
	re_match_sysno = re.compile('<datafield tag="%s" ind1=" " ind2=" ">(\s<subfield code="a">\s\|\s<subfield code="9">\s.\s</subfield>\s<subfield code="a">\s)%s' % (SYSNO_TAG[0:3],rec_sysno))
	file_content = open(filename).read()
	if re_match_001.search(file_content):
	return True
	if rec_oaiid_tag:
	if re_match_oaiid.search(file_content):
	return True
	if rec_sysno_tag:
	if re_match_sysno.search(file_content):
	return True
	else:
	# by default, check via bibrecord: this is accurate, but may be slow
	file_ = open(filename)
	records = create_records(file_.read(), 0, 0)
	for i in range(0, len(records)):
	record, all_good = records[i][:2]
	if record and all_good:
	if _record_has_id_p(record, recid, rec_oaiid, rec_sysno):
	return True
	file_.close()
	except IOError:
	continue
	return False

	def _record_has_id_p(record, recid, rec_oaiid, rec_sysno):
	"""Check if record matches any of the given IDs."""
	if record_has_field(record, '001'):
	if (record_get_field_value(record, '001', '%', '%')
	== str(recid)):
	return True
	if record_has_field(record, OAIID_TAG[0:3]):
	if (record_get_field_value(
	record, OAIID_TAG[0:3], OAIID_TAG[3],
	OAIID_TAG[4], OAIID_TAG[5]) == rec_oaiid):
	return True
	if record_has_field(record, SYSNO_TAG[0:3]):
	if (record_get_field_value(
	record, SYSNO_TAG[0:3], SYSNO_TAG[3],
	SYSNO_TAG[4], SYSNO_TAG[5]) == rec_sysno):
	return True
	return False


	def can_record_have_physical_copies(recid):
	"""Determine if the record can have physical copies
	(addable through the bibCirculation module).
	The information is derieved using the tabs displayed for a given record.
	Only records already saved within the collection may have the physical copies
	@return: True or False
	"""
	if get_record(recid) == None:
	return False

	col_id = get_colID(guess_primary_collection_of_a_record(recid))
	collections = get_detailed_page_tabs(col_id, recid)

	if (not collections.has_key("holdings")) or \
	(not collections["holdings"].has_key("visible")):
	return False

	return collections["holdings"]["visible"] == True


	def get_record_collections(recid):
	""" Returns all collections of a record, field 980
	@param recid: record id to get collections from
	@type: string

	@return: list of collections
	@rtype: list
	"""
	recstruct = get_record(recid)
	return [collection for collection in record_get_field_values(recstruct,
	tag="980",
	ind1=" ",
	ind2=" ",
	code="a")]


	def extend_record_with_template(recid):
	""" Determine if the record has to be extended with the content
	of a template as defined in CFG_BIBEDIT_EXTEND_RECORD_WITH_COLLECTION_TEMPLATE
	@return: template name to be applied to record or False if no template
	has to be applied
	"""
	rec_collections = get_record_collections(recid)

	for collection in rec_collections:
	if collection in CFG_BIBEDIT_EXTEND_RECORD_WITH_COLLECTION_TEMPLATE:
	return CFG_BIBEDIT_EXTEND_RECORD_WITH_COLLECTION_TEMPLATE[collection]
	return False


	def merge_record_with_template(rec, template_name):
	""" Extend the record rec with the contents of the template and return it"""
	template = get_record_template(template_name)
	if not template:
	return
	template_bibrec = create_record(template)[0]

	for field_tag in template_bibrec:
	if not record_has_field(rec, field_tag):
	for field_instance in template_bibrec[field_tag]:
	record_add_field(rec, field_tag, field_instance[1],
	field_instance[2], subfields=field_instance[0])
	else:
	for template_field_instance in template_bibrec[field_tag]:
	subfield_codes_template = field_get_subfield_codes(template_field_instance)
	for field_instance in rec[field_tag]:
	subfield_codes = field_get_subfield_codes(field_instance)
	for code in subfield_codes_template:
	if code not in subfield_codes:
	field_add_subfield(field_instance, code,
	field_get_subfield_values(template_field_instance,
	code)[0])
	return rec

	#################### Reference extraction ####################

	def replace_references(recid, uid=None, txt=None, url=None):
	"""Replace references for a record

	The record itself is not updated, the marc xml of the document with updated
	references is returned

	Parameters:
	* recid: the id of the record
	* txt: references in text mode
	* inspire: format of ther references
	"""
	# Parse references
	if txt is not None:
	references_xml = extract_references_from_string_xml(txt, is_only_references=True)
	elif url is not None:
	references_xml = extract_references_from_url_xml(url)
	else:
	references_xml = extract_references_from_record_xml(recid)
	- references = create_record(references_xml.encode('utf-8'))
	+ references = create_record(references_xml)

	dummy1, dummy2, record, dummy3, dummy4, dummy5, dummy6 = get_cache_file_contents(recid, uid)
	out_xml = None

	references_to_add = record_get_field_instances(references[0],
	tag='999',
	ind1='C',
	ind2='5')
	refextract_status = record_get_field_instances(references[0],
	tag='999',
	ind1='C',
	ind2='6')

	if references_to_add:
	# Replace 999 fields
	record_delete_fields(record, '999')
	record_add_fields(record, '999', references_to_add)
	record_add_fields(record, '999', refextract_status)
	# Update record references
	out_xml = record_xml_output(record)

	return out_xml

	#################### cnum generation ####################

	def record_is_conference(record):
	"""
	Determine if the record is a new conference based on the value present
	on field 980

	@param record: record to be checked
	@type record: bibrecord object

	@return: True if record is a conference, False otherwise
	@rtype: boolean
	"""
	# Get collection field content (tag 980)
	tag_980_content = record_get_field_values(record, "980", " ", " ", "a")
	if "CONFERENCES" in tag_980_content:
	return True
	return False


	def add_record_cnum(recid, uid):
	"""
	Check if the record has already a cnum. If not generate a new one
	and return the result

	@param recid: recid of the record under check. Used to retrieve cache file
	@type recid: int

	@param uid: id of the user. Used to retrieve cache file
	@type uid: int

	@return: None if cnum already present, new cnum otherwise
	@rtype: None or string
	"""
	# Import placed here to avoid circular dependency
	from invenio.sequtils_cnum import CnumSeq, ConferenceNoStartDateError

	record_revision, record, pending_changes, deactivated_hp_changes, \
	undo_list, redo_list = get_cache_file_contents(recid, uid)[1:]

	record_strip_empty_volatile_subfields(record)

	# Check if record already has a cnum
	tag_111__g_content = record_get_field_value(record, "111", " ", " ", "g")
	if tag_111__g_content:
	return
	else:
	cnum_seq = CnumSeq()
	try:
	new_cnum = cnum_seq.next_value(xml_record=wash_for_xml(print_rec(record)))
	except ConferenceNoStartDateError:
	return None
	field_add_subfield(record['111'][0], 'g', new_cnum)
	update_cache_file_contents(recid, uid, record_revision,
	record, \
	pending_changes, \
	deactivated_hp_changes, \
	undo_list, redo_list)
	return new_cnum


	def get_xml_from_textmarc(recid, textmarc_record):
	"""
	Convert textmarc to marcxml and return the result of the conversion

	@param recid: id of the record that is being converted
	@type: int

	@param textmarc_record: record content in textmarc format
	@type: string

	@return: dictionary with the following keys:
	* resultMsg: message describing conversion status
	* resultXML: xml resulting from conversion
	* parse_error: in case of error, a description of it
	@rtype: dict
	"""
	response = {}
	# Let's remove empty lines
	textmarc_record = os.linesep.join([s for s in textmarc_record.splitlines() if s])

	# Create temp file with textmarc to be converted by textmarc2xmlmarc
	(file_descriptor, file_name) = tempfile.mkstemp()
	f = os.fdopen(file_descriptor, "w")

	# Write content appending sysno at beginning
	for line in textmarc_record.splitlines():
	f.write("%09d %s\n" % (recid, re.sub("\s+", " ", line.strip())))
	f.close()

	old_stdout = sys.stdout
	try:
	# Redirect output, transform, restore old references
	new_stdout = StringIO()
	sys.stdout = new_stdout
	try:
	transform_file(file_name)
	response['resultMsg'] = 'textmarc_parsing_success'
	response['resultXML'] = new_stdout.getvalue()
	except ParseError, e:
	# Something went wrong, notify user
	response['resultXML'] = ""
	response['resultMsg'] = 'textmarc_parsing_error'
	response['parse_error'] = [e.lineno, " ".join(e.linecontent.split()[1:]), e.message]
	finally:
	sys.stdout = old_stdout
	return response


	#################### crossref utils ####################

	def crossref_process_template(template, change=False):
	"""
	Creates record from template based on xml template
	@param change: if set to True, makes changes to the record (translating the
	title, unifying autroh names etc.), if not - returns record without
	any changes
	@return: record
	"""
	record = create_record(template)[0]
	if change:
	crossref_translate_title(record)
	crossref_normalize_name(record)
	return record


	def crossref_translate_title(record):
	"""
	Convert the record's title to the Inspire specific abbreviation
	of the title (using JOURNALS knowledge base)
	@return: changed record
	"""
	# probably there is only one 773 field
	# but just in case let's treat it as a list
	for field in record_get_field_instances(record, '773'):
	title = field[0][0][1]
	new_title = get_kbr_values("JOURNALS", title, searchtype='e')
	if new_title:
	# returned value is a list, and we need only the first value
	new_title = new_title[0][0]
	position = field[4]
	record_modify_subfield(rec=record, tag='773', subfield_code='p', \
	value=new_title, subfield_position=0, field_position_global=position)


	def crossref_normalize_name(record):
	"""
	Changes the format of author's name (often with initials) to the proper,
	unified one, using bibauthor_name_utils tools
	@return: changed record
	"""
	# pattern for removing the spaces between two initials
	pattern_initials = '([A-Z]\\.)\\s([A-Z]\\.)'
	# first, change the main author
	for field in record_get_field_instances(record, '100'):
	main_author = field[0][0][1]
	new_author = create_normalized_name(split_name_parts(main_author))
	# remove spaces between initials
	# two iterations are required
	for _ in range(2):
	new_author = re.sub(pattern_initials, '\g<1>\g<2>', new_author)
	position = field[4]
	record_modify_subfield(rec=record, tag='100', subfield_code='a', \
	value=new_author, subfield_position=0, field_position_global=position)

	# then, change additional authors
	for field in record_get_field_instances(record, '700'):
	author = field[0][0][1]
	new_author = create_normalized_name(split_name_parts(author))
	for _ in range(2):
	new_author = re.sub(pattern_initials, '\g<1>\g<2>',new_author)
	position = field[4]
	record_modify_subfield(rec=record, tag='700', subfield_code='a', \
	value=new_author, subfield_position=0, field_position_global=position)
	diff --git a/modules/docextract/bin/Makefile.am b/modules/docextract/bin/Makefile.am
	index 33fa50cbb..5a80dcf2a 100644
	--- a/modules/docextract/bin/Makefile.am
	+++ b/modules/docextract/bin/Makefile.am
	@@ -1,22 +1,22 @@
	## This file is part of Invenio.
	## Copyright (C) 2004, 2005, 2006, 2007, 2008, 2010, 2011 CERN.
	##
	## Invenio is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## Invenio is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	-bin_SCRIPTS = docextract refextract
	+bin_SCRIPTS = docextract refextract convert_journals

	-EXTRA_DIST =
	+EXTRA_DIST =

	CLEANFILES = ~ .tmp
	diff --git a/modules/docextract/bin/Makefile.am b/modules/docextract/bin/convert_journals.in
	similarity index 67%
	copy from modules/docextract/bin/Makefile.am
	copy to modules/docextract/bin/convert_journals.in
	index 33fa50cbb..f4bed93c6 100644
	--- a/modules/docextract/bin/Makefile.am
	+++ b/modules/docextract/bin/convert_journals.in
	@@ -1,22 +1,32 @@
	+#!@PYTHON@
	+## -- mode: python; coding: utf-8; --
	+##
	## This file is part of Invenio.
	-## Copyright (C) 2004, 2005, 2006, 2007, 2008, 2010, 2011 CERN.
	+## Copyright (C) 2013 CERN.
	##
	## Invenio is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## Invenio is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	-bin_SCRIPTS = docextract refextract
	+import sys
	+
	+from invenio.docextract_convert_journals import cli_main, get_cli_options

	-EXTRA_DIST =

	-CLEANFILES = ~ .tmp
	+if __name__ == '__main__':
	+ try:
	+ (options, args) = get_cli_options()
	+ cli_main(options, args)
	+ except KeyboardInterrupt:
	+ # Exit cleanly
	+ print 'Interrupted'
	diff --git a/modules/docextract/etc/collaborations.kb b/modules/docextract/etc/collaborations.kb
	index ef3bb4f4a..df0916f15 100644
	--- a/modules/docextract/etc/collaborations.kb
	+++ b/modules/docextract/etc/collaborations.kb
	@@ -1,30 +1,31 @@
	## This file holds text which must be recognised alongside authors, and hence included in the $h subfields.
	## Matches using this data do not affect how references are split.
	## (Just simply appends to the most recent $h subfield for the datafield, or makes a new one).
	## Do not append an 's' to the end.
	## Insert only the Upper cased version.
	-CMS Collaboration
	-ATLAS Collaboration
	-ALICE Collaboration
	-LEP Collaboration
	-CDF Collaboration
	-D0 Collaboration
	-ALEPH Collaboration
	-DELPHI Collaboration
	-L3 Collaboration
	-OPAL Collaboration
	-CTEQ Collaboration
	-GEANT4 Collaboration
	-LHC-B Collaboration
	-CDF II Collaboration
	-RD 48 Collaboration
	-SLD Collaboration
	-H1 Collaboration
	-COMPASS Collaboration
	-HERMES Collaboration
	-European Muon Collaboration
	-Spin Muon Collaboration
	-E143 Collaboration
	-Particle Data Group Collaboration
	-ATLAS Inner Detector software group Collaboration
	-DØ Collaboration
	\ No newline at end of file
	+CMS Collaboration---CMS Collaboration
	+ATLAS Collaboration---ATLAS Collaboration
	+ALICE Collaboration---ALICE Collaboration
	+LEP Collaboration---LEP Collaboration
	+CDF Collaboration---CDF Collaboration
	+D0 Collaboration---D0 Collaboration
	+ALEPH Collaboration---ALEPH Collaboration
	+DELPHI Collaboration---DELPHI Collaboration
	+L3 Collaboration---L3 Collaboration
	+OPAL Collaboration---OPAL Collaboration
	+CTEQ Collaboration---CTEQ Collaboration
	+GEANT4 Collaboration---GEANT4 Collaboration
	+LHC-B Collaboration---LHC-B Collaboration
	+CDF II Collaboration---CDF II Collaboration
	+RD 48 Collaboration---RD 48 Collaboration
	+SLD Collaboration---SLD Collaboration
	+H1 Collaboration---H1 Collaboration
	+COMPASS Collaboration---COMPASS Collaboration
	+HERMES Collaboration---HERMES Collaboration
	+European Muon Collaboration---European Muon Collaboration
	+Spin Muon Collaboration---Spin Muon Collaboration
	+E143 Collaboration---E143 Collaboration
	+Particle Data Group Collaboration---Particle Data Group Collaboration
	+ATLAS Inner Detector software group Collaboration---ATLAS Inner Detector software group Collaboration
	+DØ Collaboration---DØ Collaboration
	+CUORE Collaboration---CUORE Collaboration
	diff --git a/modules/docextract/etc/report-numbers.kb b/modules/docextract/etc/report-numbers.kb
	index 647d41280..5163cd997 100644
	--- a/modules/docextract/etc/report-numbers.kb
	+++ b/modules/docextract/etc/report-numbers.kb
	@@ -1,233 +1,236 @@
	***LANL***
	<s/syymm999>
	<syymm999>

	ACC PHYS ---acc-phys
	ADAP ORG ---adap-org
	ALG GEOM ---alg-geom
	AO SCI ---ao-sci
	AUTO FMS ---auto-fms
	BAYES AN ---bayes-an
	CD HG ---cd-hg
	CMP LG ---cmp-lg
	COMP GAS ---comp-gas
	DG GA ---dg-ga
	FUNCT AN ---funct-an
	GR QC ---gr-qc
	ARXIVHEP EX ---hep-ex
	ARXIVHEP PH ---hep-ph
	ARXIVHEP TH ---hep-th
	LC OM ---lc-om
	MTRL TH ---mtrl-th
	NEURO CEL ---neuro-cel
	NEURO DEV ---neuro-dev
	NEURO SCI ---neuro-sci
	PATT SOL ---patt-sol


	***FermiLab***
	< 9999>
	< 999>
	< yy 999 [AET ]>
	< yyyy 999 [AET ]>
	< yyyy 99>

	FERMILAB CONF ---FERMILAB-Conf
	FERMILAB FN ---FERMILAB-FN
	FERMILAB PUB ---FERMILAB-Pub
	FERMILAB TM ---FERMILAB-TM
	FERMILAB DESIGN ---FERMILAB-Design
	FERMILAB THESIS ---FERMILAB-Thesis
	FERMILAB MASTERS---FERMILAB-Masters

	***Fermilab DØ notes***
	< 9999>

	DØ NOTE---DØ-Note

	***CERN***
	< yy 999>
	<syyyy 999>

	ALEPH ---ALEPH
	ALICE ---ALICE
	ALICE INT ---ALICE-INT
	ALICE NOTE ---ALICE-INT
	ATL CAL ---ATL-CAL
	ATL COM ---ATL-COM
	ATL COM SOFT ---ATL-COM-SOFT
	ATL COM PUB ---ATL-COM-DAQ
	ATL COM DAQ ---ATL-COM-DAQ
	ATL COM MUON ---ATL-COM-MUON
	ATL COM PHYS ---ATL-COM-PHYS
	TL COM PHYS ---ATL-COM-PHYS
	ATL COM TILECAL ---ATL-COM-TILECAL
	ATL COM LARG ---ATL-COM-LARG
	-ATL CONF ---ATL-CONF
	-ATLAS CONF ---ATL-CONF
	ATL DAQ ---ATL-DAQ
	ATL DAQ CONF ---ATL-DAQ-CONF
	ATL GEN ---ATL-GEN
	ATL INDET ---ATL-INDET
	ATL LARG ---ATL-LARG
	ATL MUON ---ATL-MUON
	ATL PUB MUON ---ATL-PUB-MUON
	ATL PHYS ---ATL-PHYS
	ATL PHYS PUB ---ATL-PHYS-PUB
	+ATL PHYSPUB ---ATL-PHYS-PUB
	+ATLPHYS PUB ---ATL-PHYS-PUB
	ATL PHYS INT ---ATL-PHYS-INT
	+ATL PHYSINT ---ATL-PHYS-INT
	+ATLPHYS INT ---ATL-PHYS-INT
	ATL TECH ---ATL-TECH
	ATL TILECAL ---ATL-TILECAL
	ATL SOFT ---ATL-SOFT
	ATL SOFT PUB ---ATL-SOFT-PUB
	ATL IS EN ---ATL-IS-EN
	ATL IS QA ---ATL-IS-QA
	ATL LARG PUB ---ATL-LARG-PUB
	ATL COM LARG ---ATL-COM-LARG
	TL COM LARG ---ATL-COM-LARG
	ATLCOM LARG ---ATL-COM-LARG
	ATL MAGNET PUB ---ATL-MAGNET-PUB
	CERN AB ---CERN-AB
	CERN ALEPH ---CERN-ALEPH
	CERN ALEPH PHYSIC ---CERN-ALEPH-PHYSIC
	CERN ALEPH PUB ---CERN-ALEPH-PUB
	CERN ALICE INT ---CERN-ALICE-INT
	CERN ALICE PUB ---CERN-ALICE-PUB
	CERN ALI ---CERN-ALI
	CERN AS ---CERN-AS
	CERN AT ---CERN-AT
	CERN ATL COM CAL ---CERN-ATL-COM-CAL
	CERN ATL COM DAQ ---CERN-ATL-COM-DAQ
	CERN ATL COM GEN ---CERN-ATL-COM-GEN
	CERN ATL COM INDET ---CERN-ATL-COM-INDET
	CERN ATL COM LARG ---CERN-ATL-COM-LARG
	CERN ATL COM MUON ---CERN-ATL-COM-MUON
	CERN ATL COM PHYS ---CERN-ATL-COM-PHYS
	CERN ATL COM TECH ---CERN-ATL-COM
	CERN ATL COM TILECAL ---CERN-ATL-COM
	CERN ATL DAQ ---CERN-ATL-DAQ
	CERN ATL SOFT ---CERN-ATL-SOFT
	CERN ATL SOFT INT ---CERN-ATL-SOFT-INT
	CERN ATL SOFT PUB ---CERN-ATL-SOFT-PUB
	CERN CMS ---CERN-CMS
	CERN CMS CR ---CERN-CMS-CR
	CERN CMS NOTE ---CERN-CMS-NOTE
	CERN CN ---CERN-CN
	CERN DD ---CERN-DD
	CERN DELPHI ---CERN-DELPHI
	CERN ECP ---CERN-ECP
	CERN EF ---CERN-EF
	CERN ECP ---CERN-EP
	CERN EST ---CERN-EST
	CERN ETT ---CERN-ETT
	CERN IT ---CERN-IT
	CERN LHCB ---CERN-LHCB
	CERN LHCC ---CERN-LHCC
	CERN LHC ---CERN-LHC
	CERN LHC PHO ---CERN-LHC-PHO
	CERN LHC PROJECT REPORT---CERN-LHC-Project-Report
	CERN OPEN ---CERN-OPEN
	CERN PPE ---CERN-PPE
	CERN PS ---CERN-PS
	CERN SL ---CERN-SL
	CERN SPSC ---CERN-SPSC
	CERN ST ---CERN-ST
	CERN TH ---CERN-TH
	CERN THESIS ---CERN-THESIS
	CERN TIS ---CERN-TIS
	CERN ATS ---CERN-ATS
	CERN ---CERN
	CMS CR ---CMS-CR
	CMS NOTE ---CMS-NOTE
	+CMS EXO ---CMS-EXO
	LHCB ---LHCB
	SN ATLAS ---SN-ATLAS
	PAS SUSY ---CMS-PAS-SUS
	CMS PAS EXO ---CMS-PAS-EXO
	CMS PAS HIN ---CMS-PAS-HIN
	CMS PAS QCD ---CMS-PAS-QCD
	CMS PAS TOP ---CMS-PAS-TOP
	CMS PAS SUS ---CMS-PAS-SUS
	CMS PAS BPH ---CMS-PAS-BPH
	CMS PAS SMP ---CMS-PAS-SMP
	CMS PAS HIG ---CMS-PAS-HIG
	CMS PAS EWK ---CMS-PAS-EWK
	CMS PAS BTV ---CMS-PAS-BTV
	CMS PAS FWD ---CMS-PAS-FWD
	CMS PAS TRK ---CMS-PAS-TRK
	CMS PAS SMP ---CMS-PAS-SMP
	CMS PAS PFT ---CMS-PAS-PFT
	CMS PAS MUO ---CMS-PAS-MUO
	CMS PAS JME ---CMS-PAS-JME
	CMS PAS EGM ---CMS-PAS-EGM
	CMS PAS DIF ---CMS-PAS-DIF
	ATLTILECAL PUB ---ATLTILECAL-PUB
	ATLAS TECH PUB ---ATLAS-TECH-PUB
	TLCOM MAGNET ---TLCOM-MAGNET
	ATLLARG ---ATL-LARG

	***CERN MORE***
	< yyyy 999>
	< yyyy 99>
	< yyyy 9>
	< yy 99>
	< yy 9>
	CERN LHCB ---CERN-LHCB
	CERN LHCC ---CERN-LHCC
	CERN PHESS ---CERN-PHESS


	***CERN EVEN MORE***
	< 9>

	CMS UG TP ---CMS-UG-TP


	***CERN DIFFERENT FORMAT***
	< 9999999>
	CERN GE ---CERN-GE


	***LHC***
	< 999>
	< 9999>

	CERN CLIC NOTE ---CERN-CLIC-Note
	LHC PROJECT NOTE ---LHC-Project-Note
	CERN LHC PROJECT REPORT ---CERN-LHC-Project-Report
	LHC PROJECT REPORT ---CERN-LHC-Project-Report
	CLIC NOTE ---CERN-CLIC-Note
	ATLAS TDR ---ATL-TDR
	CMS TDR ---CMS-TDR
	ATC TT ID ---ATC-TT-ID
	ATC TT IN ---ATC-TT-IN
	LHCCP ---LHCCP

	***KEK***
	< 9999>
	< yy 999>
	< yyyy 999>

	KEK CP ---KEK-CP
	KEK INT ---KEK-Internal
	KEK INTERNAL ---KEK-Internal
	KEK PREPRINT ---KEK-Preprint
	KEK TH ---KEK-TH


	***DESY***
	< yy 999>
	< yyyy 999>

	DESY ---DESY
	DESY M ---DESY M


	***SLAC***
	< 999>
	< 9999>
	< yy 99>

	SLAC AP ---SLAC-AP
	SLAC PUB ---SLAC-PUB
	SLAC R ---SLAC-R
	SLAC TN ---SLAC-TN
	SLAC WP ---SLAC-WP
	diff --git a/modules/docextract/lib/Makefile.am b/modules/docextract/lib/Makefile.am
	index 3ca414c01..ab6a60506 100644
	--- a/modules/docextract/lib/Makefile.am
	+++ b/modules/docextract/lib/Makefile.am
	@@ -1,48 +1,51 @@
	## This file is part of Invenio.
	## Copyright (C) 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2013 CERN.
	##
	## Invenio is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## Invenio is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	pylibdir = $(libdir)/python/invenio

	pylib_DATA = docextract_pdf.py \
	docextract_text.py \
	docextract_utils.py \
	docextract_task.py \
	docextract_webinterface.py \
	docextract_webinterface_unit_tests.py \
	+ docextract_record.py \
	+ docextract_record_regression_tests.py \
	+ docextract_convert_journals.py \
	refextract.py \
	refextract_task.py \
	refextract_config.py \
	refextract_engine.py \
	refextract_re.py \
	refextract_api.py \
	refextract_api_unit_tests.py \
	refextract_api_regression_tests.py \
	refextract_text.py \
	- refextract_xml.py \
	+ refextract_record.py \
	refextract_find.py \
	refextract_tag.py \
	refextract_cli.py \
	refextract_kbs.py \
	refextract_linker.py \
	refextract_regression_tests.py \
	refextract_unit_tests.py \
	authorextract_re.py


	EXTRA_DIST = $(pylib_DATA)

	CLEANFILES = ~ .tmp *.pyc
	diff --git a/modules/docextract/lib/authorextract_re.py b/modules/docextract/lib/authorextract_re.py
	index 3b28d68aa..d54e9f678 100644
	--- a/modules/docextract/lib/authorextract_re.py
	+++ b/modules/docextract/lib/authorextract_re.py
	@@ -1,461 +1,451 @@
	# -- coding: utf-8 --
	##
	## This file is part of Invenio.
	## Copyright (C) 2010, 2011 CERN.
	##
	## Invenio is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## Invenio is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	import re
	import sys
	from invenio.docextract_utils import write_message
	from invenio.refextract_config import CFG_REFEXTRACT_KBS


	def get_author_affiliation_numeration_str(punct=None):
	"""The numeration which can be applied to author names. Numeration
	is sometimes found next to authors of papers.
	@return: (string), which can be compiled into a regex; identifies
	numeration next to an author name.
	"""

	##FIXME cater for start or end numeration (ie two puncs)

	## Number to look for, either general or specific
	re_number = '(?:\d\d?)'
	re_chained_numbers = "(?:(?:[,;]\s%s\.?\s))*" % re_number
	## Punctuation surrounding the number, either general or specific again
	if punct is None:
	re_punct = "(?:[\{\(\[]?)"
	else:
	re_punct = re.escape(punct)

	## Generic number finder (MUST NOT INCLUDE NAMED GROUPS!!!)
	numeration_str = """
	(?:\s(%(punct)s)\s ## Left numeration punctuation
	(%(num)s\s* ## Core numeration item, either specific or generic
	%(num_chain)s ## Extra numeration, either generic or empty
	)
	(?:(%(punct)s)) ## Right numeration punctuation
	)""" % {'num' : re_number,
	'num_chain' : re_chained_numbers,
	'punct' : re_punct}
	return numeration_str


	def get_initial_surname_author_pattern(incl_numeration=False):
	"""Match an author name of the form: 'initial(s) surname'

	Return a standard author, with a maximum of 6 initials, and a surname.
	The author pattern returned will match 'Initials Surname' formats only.
	The Initials MUST be uppercase, and MUST have at least a dot, hypen or apostrophe between them.
	@param incl_numeration: (boolean) Return an author pattern with optional numeration after authors.
	@return (string): The 'Initials Surname' author pattern."""
	# Possible inclusion of superscript numeration at the end of author names
	# Will match the empty string
	if incl_numeration:
	append_num_re = get_author_affiliation_numeration_str() + '?'
	else:
	append_num_re = ""

	return ur"""
	(?:
	(?:[A-Z]\w{2,20}\s+)? ## Optionally a first name before the initials

	(?<!Volume\s) ## Initials (1-5) (cannot follow 'Volume\s')
	[A-Z](?:\s[.'’\s-]{1,3}\s[A-Z]){0,4}[.\s-]{1,2}\s* ## separated by .,-,',etc.

	(?:[A-Z]\w{2,20}\s+)? ## Optionally a first name after the initials

	(?:
	(?!%(invalid_prefixes)s) ## Invalid prefixes to avoid
	[A-Za-z]{1,3}(?<!and)(?:(?:[’'`´-]\s?)\|\s) ## The surname prefix: 1, 2 or 3
	)? ## character prefixes before the surname (e.g. 'van','de')

	(?!%(invalid_surnames)s) ## Invalid surnames to avoid
	[A-Z] ## The surname, which must start with an upper case character
	(?:[rR]\.\|\w{1,20}) ## handle Jr.
	(?:[\-’'`´][\w’']{1,20})? ## single hyphen allowed jan-el or Figueroa-O'Farrill
	[’']? ## Eventually an ending '

	%(numeration)s ## A possible number to appear after an author name, used for author extraction

	(?: # Look for editor notation after the author group...
	\s,?\s # Eventually a coma/space
	%(ed)s
	)?
	)""" % {
	'invalid_prefixes': '\|'.join(invalid_prefixes),
	'invalid_surnames': '\|'.join(invalid_surnames),
	'ed' : re_ed_notation,
	'numeration' : append_num_re,
	}


	def get_surname_initial_author_pattern(incl_numeration=False):
	"""Match an author name of the form: 'surname initial(s)'

	This is sometimes the represention of the first author found inside an author group.
	This author pattern is only used to find a maximum of ONE author inside an author group.
	Authors of this form MUST have either a comma after the initials, or an 'and',
	which denotes the presence of other authors in the author group.
	@param incl_numeration: (boolean) Return an author pattern with optional numeration after authors.
	@return (string): The 'Surname Initials' author pattern."""
	# Possible inclusion of superscript numeration at the end of author names
	# Will match the empty string
	if incl_numeration:
	append_num_re = get_author_affiliation_numeration_str() + '?'
	else:
	append_num_re = ""

	return ur"""
	(?:
	(?:
	(?!%(invalid_prefixes)s) ## Invalid prefixes to avoid
	[A-Za-z]{1,3}(?<!and)(?<!in)(?:(?:[’'`´-]\s?)\|\s)
	)? ## The optional surname prefix:
	## 1 or 2, 2-3 character prefixes before the surname (e.g. 'van','de')

	(?!%(invalid_surnames)s) ## Invalid surnames to avoid
	[A-Z]\w{2,20}(?:[\-’'`´]\w{2,20})? ## The surname, which must start with an upper case character (single hyphen allowed)

	\s[,.\s]\s ## The space between the surname and its initials

	(?<!Volume\s) ## Initials
	[A-Z](?:\s[.'’\s-]{1,2}\s[A-Z]){0,4}\.{0,2} ##

	## Either a comma or an 'and' MUST be present ... OR an end of line marker
	## (maybe some space's between authors)
	## Uses positive lookahead assertion
	(?: # Look for editor notation after the author group...
	\s,?\s # Eventually a coma/space
	%(ed)s
	)?
	)""" % {
	'invalid_prefixes': '\|'.join(invalid_prefixes),
	'invalid_surnames': '\|'.join(invalid_surnames),
	'ed' : re_ed_notation,
	'numeration' : append_num_re,
	}


	invalid_surnames = (
	- 'Supergravity', 'Collaboration', 'Theoretical', 'Appendix'
	+ 'Supergravity', 'Collaboration', 'Theoretical', 'Appendix', 'Phys', 'Paper'
	)
	invalid_prefixes = (
	'at',
	)


	def make_auth_regex_str(etal, initial_surname_author=None, surname_initial_author=None):
	"""
	Returns a regular expression to be used to identify groups of author names in a citation.
	This method contains patterns for default authors, so no arguments are needed for the
	most reliable form of matching.

	The returned author pattern is capable of:
	1. Identifying single authors, with at least one initial, of the form:
	'Initial. [surname prefix...] Surname'

	2. Identifying multiple authors, each with at least one initial, of the form:
	'Initial. [surname prefix...] Surname, [and] [Initial. [surname prefix...] Surname ... ]'
	***(Note that a full stop, hyphen or apostrophe after each initial is
	absolutely vital in identifying authors for both of these above methods.
	Initials must also be uppercase.)***

	3. Capture 'et al' statements at the end of author groups (allows for authors with et al
	to be processed differently from 'standard' authors)

	4. Identifying a single author surname name positioned before the phrase 'et al',
	with no initials: 'Surname et al'

	5. Identifying two author surname name positioned before the phrase 'et al',
	with no initials, but separated by 'and' or '&': 'Surname [and\|&] Surname et al'

	6. Identifying authors of the form:
	'Surname Initials, Initials Surname [Initials Surname]...'. Some authors choose
	to represent the most important cited author (in a list of authors) by listing first
	their surname, and then their initials. Since this form has little distinguishing
	characteristics which could be used to create a reliable a pattern, at least one
	standard author must be present after it in order to improve the accuracy.

	7. Capture editor notation, of which can take many forms e.g.
	'eds. editors. edited by. etc.'. Authors captured in this way can be treated as
	'editor groups', and hence processed differently if needed from standard authors

	@param etal: (string) The regular expression used to identify 'etal' notation
	@param author: (string) An optional argument, which replaces the default author
	regex used to identify author groups (initials, surnames... etc)

	@return: (string) The full author group identification regex, which will:
	- detect groups of authors in a range of formats, e.g.:
	C. Hayward, V van Edwards, M. J. Woodbridge, and L. Kelloggs et al.,
	- detect whether the author group has been marked up as editors of the doc.
	(therefore they will NOT be marked up as authors) e.g.:
	ed. C Hayward \| (ed) V van Edwards \| ed by, M. J. Woodbridge and V van Edwards
	\| L. Kelloggs (editors) \| M. Jackson (eds.) \| ...
	-detect a maximum of two surnames only if the surname(s) is followed by 'et al'
	(must be separated by 'and' if there are two), e.g.:
	Amaldi et al., \| Hayward and Yellow et al.,
	"""
	if not initial_surname_author:
	## Standard author, with a maximum of 6 initials, and a surname.
	## The Initials MUST be uppercase, and MUST have at least a dot, hypen or apostrophe between them.
	initial_surname_author = get_initial_surname_author_pattern()

	if not surname_initial_author:
	## The author name of the form: 'surname initial(s)'
	## This is sometimes the represention of the first author found inside an author group.
	## This author pattern is only used to find a maximum of ONE author inside an author group.
	## Authors of this form MUST have either a comma after the initials, or an 'and',
	## which denotes the presence of other authors in the author group.
	surname_initial_author = get_surname_initial_author_pattern()

	## Pattern used to locate a GROUP of author names in a reference
	## The format of an author can take many forms:
	## J. Bloggs, W.-H. Smith, D. De Samuel, G.L. Bayetian, C. Hayward et al.,
	## (the use of 'et. al' is a giveaway that the preceeding
	## text was indeed an author name)
	## This will also match authors which seem to be labeled as editors (with the phrase 'ed.')
	## In which case, the author will be thrown away later on.
	## The regex returned has around 100 named groups already (max), so any new groups must be
	## started using '?:'

	return ur"""
	(?:^\|\s+\|\() ## Must be the start of the line, or a space (or an opening bracket in very few cases)
	(?P<es> ## Look for editor notation before the author
	(?:(?:(?:[Ee][Dd]s?\|[Ee]dited\|[Ee]ditors?)((?:\.\s?)\|(?:\.?\s))) ## 'eds?. ' \| 'ed ' \| 'ed.'
	\|(?:(?:[Ee][Dd]s?\|[Ee]dited\|[Ee]ditions?)(?:(?:\.\s?)\|(?:\.?\s))by(?:\s\|([:,]\s))) ## 'eds?. by, ' \| 'ed. by: ' \| 'ed by ' \| 'ed. by '\| 'ed by: '
	\|(?:$\s?([Ee][Dd]s?\|[Ee]dited\|[Ee]ditors?)(?:(?:\.\s?)\|(?:\.?\s))?$)) ## '( eds?. )' \| '(ed.)' \| '(ed )' \| '( ed )' \| '(ed)'
	)?

	## **** (1) , one or two surnames which MUST end with 'et al' (e.g. Amaldi et al.,)
	(?P<author_names>
	(?:
	(?:[A-Z](?:\s[.'’-]{1,2}\s[A-Z]){0,4}[.\s]\s*)? ## Initials
	[A-Z][^0-9_\.\s]{2,20}(?:(?:[,\.]\s*)\|(?:[,\.]?\s+)) ## Surname
	(?:[A-Z](?:\s[.'’-]{1,2}\s[A-Z]){0,4}[.\s]\s*)? ## Initials
	(?P<multi_surs>
	(?:(?:[Aa][Nn][Dd]\|\&)\s+) ## Maybe 'and' or '&' tied with another name
	[A-Z][^0-9_\.\s]{3,20}(?:(?:[,\.]\s*)\|(?:[,\.]?\s+)) ## More surnames
	(?:[A-Z](?:[ -][A-Z])?\s+)? ## with initials
	)?
	(?: # Look for editor notation after the author group...
	\s[,\s]?\s # Eventually a coma/space
	%(ed)s
	)?
	(?P<et2>
	%(etal)s ## et al, MUST BE PRESENT however, for this author form
	)
	(?: # Look for editor notation after the author group...
	\s[,\s]?\s # Eventually a coma/space
	%(ed)s
	)?
	) \|

	(?:
	## **** (2) , The standard author form.. (e.g. J. Bloggs)
	## This author form can either start with a normal 'initial surname' author,
	## or it can begin with a single 'surname initial' author

	(?: ## The first author in the 'author group'
	%(i_s_author)s \|
	(?P<sur_initial_auth>%(s_i_author)s)
	)

	(?P<multi_auth>
	(?: ## Then 0 or more author names
	\s[,\s]\s
	(?:
	%(i_s_author)s \| %(s_i_author)s
	)
	)*

	(?: ## Maybe 'and' or '&' tied with another name
	(?:
	\s[,\s]\s ## handle "J. Dan, and H. Pon"
	(?:[Aa][Nn][DdsS]\|\&)
	\s+
	)
	(?P<mult_auth_sub>
	%(i_s_author)s \| %(s_i_author)s
	)
	)?
	)
	(?P<et> # 'et al' need not be present for either of
	\s[,\s]\s
	%(etal)s # 'initial surname' or 'surname initial' authors
	)?
	)
	)
	(?P<ee>
	\s[,\s]\s
	\(?
	(?:[Ee][Dd]s\|[Ee]ditors)\.?
	\)?
	[\.\,]{0,2}
	)?
	# End of all author name patterns

	\)? # A possible closing bracket to finish the author group
	- (?=[\s,.;]) # Consolidate by checking we are not partially matching
	+ (?=[\s,.;:]) # Consolidate by checking we are not partially matching
	# something else

	""" % { 'etal' : etal,
	'i_s_author' : initial_surname_author,
	's_i_author' : surname_initial_author,
	'ed' : re_ed_notation }

	## Finding an et. al, before author names indicates a bad match!!!
	## I.e. could be a title match... ignore it
	etal_matches = (
	u' et al.,',
	u' et. al.,',
	u' et. al.',
	u' et.al.,',
	u' et al.',
	u' et al',
	)

	# Editor notation: 'eds?.' \| 'ed.' \| 'ed'
	re_ed_text = ur"(?:[Ee][Dd]\|[Ee]dited\|[Ee]ditor)\.?"
	re_ed_notation = ur"""
	(?:
	\(?
	%(text)s
	\s?
	\)?
	[\.\,]{0,2}
	)""" % {'text': re_ed_text}

	## Standard et al ('and others') pattern for author recognition
	re_etal = ur"""[Ee][Tt](?:(?:(?:,\|\.)\s*)\|(?:(?:,\|\.)?\s+))[Aa][Ll][,\.]?[,\.]?"""

	## The pattern used to identify authors inside references
	re_auth = (re.compile(make_auth_regex_str(re_etal), re.VERBOSE\|re.UNICODE))

	## Given an Auth hit, some misc text, and then another Auth hit straight after,
	## (OR a bad_and was found)
	## check the entire misc text to see if is 'looks' like an author group, which didn't match
	## as a normal author. In which case, append it to the single author group.
	## PLEASE use this pattern only against space stripped text.
	## IF a bad_and was found (from above).. do re.search using this pattern
	## ELIF an auth-misc-auth combo was hit, do re.match using this pattern
	re_weaker_author = ur"""
	## look closely for initials, and less closely at the last name.
	(?:([A-Z]((\.\s?)\|(\.?\s+)\|(\-))){1,5}
	(?:[^\s_<>0-9]+(?:(?:[,\.]\s*)\|(?:[,\.]?\s+)))+)"""

	## End of line MUST match, since the next string is definitely a portion of an author group (append '$')
	re_auth_near_miss = re.compile(make_auth_regex_str(
	re_etal, "(" + re_weaker_author + ")+$"), re.VERBOSE\|re.UNICODE)

	## Used as a weak mechanism to classify possible authors above identified affiliations
	## (start) Firstname SurnamePrefix Surname (end)
	re_ambig_auth = re.compile(u"^\s[A-Z][^\s_<>0-9]+\s+([^\s_<>0-9]{1,3}\.?\s+)?[A-Z][^\s_<>0-9]+\s$", \
	re.UNICODE)

	## Obtain the compiled expression which includes the proper author numeration
	## (The pattern used to identify authors of papers)
	## This pattern will match groups of authors, from the start of the line
	re_auth_with_number = re.compile(make_auth_regex_str(
	re_etal,
	get_initial_surname_author_pattern(incl_numeration=True),
	get_surname_initial_author_pattern(incl_numeration=True)
	), re.VERBOSE \| re.UNICODE)

	## Used to obtain authors chained by connectives across multiple lines
	re_comma_or_and_at_start = re.compile("^(,\|((,\s*)?[Aa][Nn][Dd]\|&))\s", re.UNICODE)


	-def make_extra_author_regex_str():
	+def make_collaborations_regex_str():
	""" From the authors knowledge-base, construct a single regex holding the or'd possibilities of patterns
	which should be included in $h subfields. The word 'Collaboration' is also converted to 'Coll', and
	used in finding matches. Letter case is not considered during the search.
	@return: (string) The single pattern built from each line in the author knowledge base.
	"""
	def add_to_auth_list(s):
	""" Strip the line, replace spaces with '\s' and append 'the' to the start
	and 's' to the end. Add the prepared line to the list of extra kb authors."""
	s = u"(?:the\s)?" + s.strip().replace(u' ', u'\s') + u"s?"
	auths.append(s)

	## Build the 'or'd regular expression of the author lines in the author knowledge base
	auths = []
	fpath = CFG_REFEXTRACT_KBS['collaborations']

	try:
	fh = open(fpath, "r")
	except IOError:
	## problem opening KB for reading, or problem while reading from it:
	emsg = """Error: Could not build knowledge base containing """ \
	"""author patterns - failed """ \
	"""to read from KB %(kb)s.\n""" \
	% {'kb' : fpath}
	write_message(emsg, sys.stderr, verbose=0)
	- raise IOError("Error: Unable to open author kb '%s'" % fpath)
	+ raise IOError("Error: Unable to open collaborations kb '%s'" % fpath)

	for line_num, rawline in enumerate(fh):
	try:
	rawline = rawline.decode("utf-8")
	except UnicodeError:
	write_message("*** Unicode problems in %s for line %d" \
	% (fpath, line_num), sys.stderr, verbose=0)
	- raise UnicodeError("Error: Unable to parse author kb (line: %s)" % str(line_num))
	+ raise UnicodeError("Error: Unable to parse collaboration kb (line: %s)" % str(line_num))
	if rawline.strip() and rawline[0].strip() != '#':
	add_to_auth_list(rawline)
	## Shorten collaboration to 'coll'
	if rawline.lower().endswith('collaboration\n'):
	coll_version = rawline[:rawline.lower().find(u'collaboration\n')] + u"coll[\.\,]"
	add_to_auth_list(coll_version.strip().replace(' ', '\s') + u"s?")

	author_match_re = ""
	if len(auths) > 0:
	author_match_re = u'\|'.join([u"(?:" + a + u")" for a in auths])
	author_match_re = ur"(?:(?:[\(\"]?(?P<extra_auth>" + \
	author_match_re + ur")[\)\"]?[\,\.]?\s?(?:and\s)?)+)"

	return author_match_re

	## Create the regular expression used to find user-specified 'extra' authors
	## (letter case is not concidered when matching)
	-re_extra_auth = re.compile(make_extra_author_regex_str(), re.IGNORECASE)
	-
	-
	-def get_single_and_extra_author_pattern():
	- """Generates a simple, one-hit-only, author name pattern, matching just one author
	- name, but ALSO INCLUDING author names generated from the knowledge base. The author
	- patterns are the same ones used inside the main 'author group' pattern generator.
	- This function is used not for reference extraction, but for author extraction.
	- @return: (string) the union of the built-in author pattern, with the kb defined
	- patterns."""
	- return get_single_author_pattern() + "\|" + make_extra_author_regex_str()
	+re_collaborations = re.compile(make_collaborations_regex_str(), re.I\|re.U)


	def get_single_author_pattern():
	"""Generates a simple, one-hit-only, author name pattern, matching just one author
	name in either of the 'S I' or 'I S' formats. The author patterns are the same
	ones used inside the main 'author group' pattern generator. This function is used
	not for reference extraction, but for author extraction. Numeration is appended
	to author patterns by default.
	@return (string): Just the author name pattern designed to identify single author names
	in both SI and IS formats. (NO 'et al', editors, 'and'... matching)
	@return: (string) the union of 'initial surname' and 'surname initial'
	authors"""
	return "(?:"+ get_initial_surname_author_pattern(incl_numeration=True) + \
	"\|" + get_surname_initial_author_pattern(incl_numeration=True) + ")"


	## Targets single author names
	-re_single_author_pattern = re.compile(get_single_and_extra_author_pattern(), re.VERBOSE)
	+re_single_author_pattern = re.compile(get_single_author_pattern(), re.VERBOSE)
	diff --git a/modules/docextract/lib/docextract_convert_journals.py b/modules/docextract/lib/docextract_convert_journals.py
	new file mode 100644
	index 000000000..3e76c89c8
	--- /dev/null
	+++ b/modules/docextract/lib/docextract_convert_journals.py
	@@ -0,0 +1,161 @@
	+# -- coding: utf-8 --
	+##
	+## This file is part of Invenio.
	+## Copyright (C) 2013 CERN.
	+##
	+## Invenio is free software; you can redistribute it and/or
	+## modify it under the terms of the GNU General Public License as
	+## published by the Free Software Foundation; either version 2 of the
	+## License, or (at your option) any later version.
	+##
	+## Invenio is distributed in the hope that it will be useful, but
	+## WITHOUT ANY WARRANTY; without even the implied warranty of
	+## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	+## General Public License for more details.
	+##
	+## You should have received a copy of the GNU General Public License
	+## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	+## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.
	+
	+import optparse
	+import sys
	+
	+from invenio.docextract_record import create_records, print_records
	+from invenio.refextract_kbs import get_kbs
	+
	+from invenio.docextract_text import re_group_captured_multiple_space
	+from invenio.refextract_re import re_punctuation
	+
	+
	+DESCRIPTION = """Utility to convert journal names from abbreviations
	+or full names to their short form"""
	+
	+HELP_MESSAGE = """
	+ -o, --out Write the extracted references, in xml form, to a file
	+ rather than standard output.
	+"""
	+
	+USAGE_MESSAGE = """Usage: convert_journals [options] file1 [file2 ...]
	+Command options: %s
	+Examples:
	+ convert_journals -o /home/chayward/thesis-out.xml /home/chayward/thesis.xml
	+""" % HELP_MESSAGE
	+
	+
	+def mangle_value(kb, value):
	+ value = re_punctuation.sub(u' ', value.upper())
	+ value = re_group_captured_multiple_space.sub(u' ', value)
	+ value = value.strip()
	+
	+ standardized_titles = kb[1]
	+ if value in standardized_titles:
	+ value = standardized_titles[value]
	+
	+ return value
	+
	+
	+def mangle(kb, value):
	+ try:
	+ title, volume, page = value.split(',')
	+ except ValueError:
	+ pass
	+ else:
	+ value = '%s,%s,%s' % (mangle_value(kb, title), volume, page)
	+ return value
	+
	+def convert_journals(kb, record):
	+ for subfield in record.find_subfields('999C5s'):
	+ subfield.value = mangle(kb, subfield.value)
	+ for subfield in record.find_subfields('773__p'):
	+ subfield.value = mangle_value(kb, subfield.value)
	+ return record
	+
	+
	+def convert_journals_list(kb, records):
	+ return [convert_journals(kb, record) for record in records]
	+
	+
	+def write_records(config, records):
	+ """Write marcxml to file
	+
	+ * Output xml header
	+ * Output collection opening tag
	+ * Output xml for each record
	+ * Output collection closing tag
	+ """
	+ if config.xmlfile:
	+ out = open(config.xmlfile, 'w')
	+ else:
	+ out = sys.stdout
	+
	+ xml = print_records(records)
	+
	+ try:
	+ print >>out, xml
	+ out.flush()
	+ finally:
	+ if config.xmlfile:
	+ out.close()
	+
	+
	+def usage(wmsg=None, err_code=0):
	+ """Display a usage message for refextract on the standard error stream and
	+ then exit.
	+ @param wmsg: (string) some kind of brief warning message for the user.
	+ @param err_code: (integer) an error code to be passed to halt,
	+ which is called after the usage message has been printed.
	+ @return: None.
	+ """
	+ # Display the help information and the warning in the stderr stream
	+ # 'help_message' is global
	+ if wmsg:
	+ print >> sys.stderr, wmsg
	+ print >> sys.stderr, USAGE_MESSAGE
	+ sys.exit(err_code)
	+
	+
	+def cli_main(options, args):
	+ if options.help or not args:
	+ usage()
	+ return
	+
	+ if options.kb_journals:
	+ kbs_files = {'journals': options.kb_journals}
	+ else:
	+ kbs_files = {}
	+
	+ kb = get_kbs(custom_kbs_files=kbs_files)['journals']
	+
	+ out_records = []
	+ for path in args:
	+ f = open(path)
	+ try:
	+ xml = f.read()
	+ finally:
	+ f.close()
	+
	+ out_records += convert_journals_list(kb, create_records(xml))
	+
	+ write_records(options, out_records)
	+
	+
	+def get_cli_options():
	+ """Get the various arguments and options from the command line and populate
	+ a dictionary of cli_options.
	+ @return: (tuple) of 2 elements. First element is a dictionary of cli
	+ options and flags, set as appropriate; Second element is a list of cli
	+ arguments.
	+ """
	+ parser = optparse.OptionParser(description=DESCRIPTION,
	+ usage=USAGE_MESSAGE,
	+ add_help_option=False)
	+ # Display help and exit
	+ parser.add_option('-h', '--help', action='store_true')
	+ # Write out MARC XML references to the specified file
	+ parser.add_option('-o', '--out', dest='xmlfile')
	+ # Handle verbosity
	+ parser.add_option('-v', '--verbose', type=int, dest='verbosity', default=0)
	+ # Specify a different journals database
	+ parser.add_option('--kb-journals', dest='kb_journals')
	+
	+ return parser.parse_args()
	diff --git a/modules/docextract/lib/docextract_convert_journals_unit_tests.py b/modules/docextract/lib/docextract_convert_journals_unit_tests.py
	new file mode 100644
	index 000000000..415bc44ff
	--- /dev/null
	+++ b/modules/docextract/lib/docextract_convert_journals_unit_tests.py
	@@ -0,0 +1,96 @@
	+# -- coding: utf-8 --
	+##
	+## This file is part of Invenio.
	+## Copyright (C) 2013 CERN.
	+##
	+## Invenio is free software; you can redistribute it and/or
	+## modify it under the terms of the GNU General Public License as
	+## published by the Free Software Foundation; either version 2 of the
	+## License, or (at your option) any later version.
	+##
	+## Invenio is distributed in the hope that it will be useful, but
	+## WITHOUT ANY WARRANTY; without even the implied warranty of
	+## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	+## General Public License for more details.
	+##
	+## You should have received a copy of the GNU General Public License
	+## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	+## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.
	+
	+import os
	+
	+import subprocess
	+from tempfile import NamedTemporaryFile, mkstemp
	+
	+from invenio.docextract_record import BibRecord
	+from invenio.refextract_kbs import get_kbs
	+from invenio.config import CFG_BINDIR, CFG_TMPDIR
	+from invenio.testutils import XmlTest
	+from invenio.docextract_convert_journals import USAGE_MESSAGE, convert_journals
	+
	+
	+class ConverterTests(XmlTest):
	+ def setUp(self):
	+ kb = [("TEST JOURNAL NAME", "Converted")]
	+ kbs_files = {'journals': kb}
	+ self.kb = get_kbs(custom_kbs_files=kbs_files)['journals']
	+
	+ def test_simple(self):
	+ record = BibRecord()
	+ record.add_subfield('100__a', 'Test Journal Name')
	+ record.add_subfield('773__p', 'Test Journal Name')
	+ record.add_subfield('999C5s', 'Test Journal Name,100,10')
	+ converted_record = convert_journals(self.kb, record)
	+
	+ expected_record = BibRecord()
	+ expected_record.add_subfield('100__a', 'Test Journal Name')
	+ expected_record.add_subfield('773__p', 'Converted')
	+ expected_record.add_subfield('999C5s', 'Converted,100,10')
	+
	+ self.assertEqual(expected_record, converted_record)
	+
	+
	+class ScriptTests(XmlTest):
	+ def setUp(self):
	+ self.bin_path = os.path.join(CFG_BINDIR, 'convert_journals')
	+
	+ def test_usage(self):
	+ process = subprocess.Popen([self.bin_path, '-h'],
	+ stderr=subprocess.PIPE,
	+ stdout=subprocess.PIPE)
	+ process.wait()
	+ self.assert_(USAGE_MESSAGE in process.stderr.read())
	+
	+ def test_main(self):
	+ xml = """<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="s">Test Journal Name,100,10</subfield>
	+ </datafield>
	+ </record>"""
	+ xml_temp_file = NamedTemporaryFile(dir=CFG_TMPDIR)
	+ xml_temp_file.write(xml)
	+ xml_temp_file.flush()
	+
	+ kb = "TEST JOURNAL NAME---Converted"
	+ kb_temp_file = NamedTemporaryFile(dir=CFG_TMPDIR)
	+ kb_temp_file.write(kb)
	+ kb_temp_file.flush()
	+
	+ dest_temp_fd, dest_temp_path = mkstemp(dir=CFG_TMPDIR)
	+ try:
	+ os.close(dest_temp_fd)
	+
	+ process = subprocess.Popen([self.bin_path, xml_temp_file.name,
	+ '--kb', kb_temp_file.name,
	+ '-o', dest_temp_path],
	+ stderr=subprocess.PIPE,
	+ stdout=subprocess.PIPE)
	+ process.wait()
	+
	+ transformed_xml = open(dest_temp_path).read()
	+ self.assertXmlEqual(transformed_xml, """<?xml version="1.0" encoding="UTF-8"?>
	+<collection xmlns="http://www.loc.gov/MARC21/slim">
	+<record><datafield ind1="C" ind2="5" tag="999"><subfield code="s">Converted,100,10</subfield></datafield></record>
	+</collection>""")
	+ finally:
	+ os.unlink(dest_temp_path)
	diff --git a/modules/docextract/lib/docextract_record.py b/modules/docextract/lib/docextract_record.py
	new file mode 100644
	index 000000000..5513dfb51
	--- /dev/null
	+++ b/modules/docextract/lib/docextract_record.py
	@@ -0,0 +1,260 @@
	+# -- coding: utf-8 --
	+##
	+## This file is part of Invenio.
	+## Copyright (C) 2013 CERN.
	+##
	+## Invenio is free software; you can redistribute it and/or
	+## modify it under the terms of the GNU General Public License as
	+## published by the Free Software Foundation; either version 2 of the
	+## License, or (at your option) any later version.
	+##
	+## Invenio is distributed in the hope that it will be useful, but
	+## WITHOUT ANY WARRANTY; without even the implied warranty of
	+## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	+## General Public License for more details.
	+##
	+## You should have received a copy of the GNU General Public License
	+## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	+## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.
	+
	+
	+from operator import itemgetter
	+try:
	+ from xml.etree import ElementTree as ET
	+except ImportError:
	+ import elementtree.ElementTree as ET
	+
	+from invenio.search_engine import get_record as get_record_original
	+from invenio.bibrecord import create_record as create_record_original, \
	+ create_records as create_records_original
	+
	+
	+def parse_tag(tag):
	+ tag_code = tag[0:3]
	+
	+ try:
	+ ind1 = tag[3]
	+ except IndexError:
	+ ind1 = " "
	+
	+ if ind1 == '_':
	+ ind1 = ' '
	+
	+ try:
	+ ind2 = tag[4]
	+ except IndexError:
	+ ind2 = " "
	+
	+ if ind2 == '_':
	+ ind2 = ' '
	+
	+ try:
	+ subfield_code = tag[5]
	+ except IndexError:
	+ subfield_code = None
	+
	+ return tag_code, ind1, ind2, subfield_code
	+
	+
	+def convert_record(bibrecord):
	+ def create_control_field(inst):
	+ return BibRecordControlField(inst[3].decode('utf-8'))
	+
	+ def create_field(inst):
	+ subfields = [BibRecordSubField(code, value.decode('utf-8'))
	+ for code, value in inst[0]]
	+ return BibRecordField(ind1=inst[1], ind2=inst[2], subfields=subfields)
	+
	+ record = BibRecord()
	+ for tag, instances in bibrecord.iteritems():
	+ if tag.startswith('00'):
	+ record[tag] = [create_control_field(inst) for inst in instances]
	+ else:
	+ record[tag] = [create_field(inst) for inst in instances]
	+
	+ return record
	+
	+
	+def get_record(recid):
	+ """Fetch record from the database and loads it into a bibrecord"""
	+ record = get_record_original(recid)
	+ return convert_record(record)
	+
	+
	+def create_record(xml):
	+ record = create_record_original(xml)[0]
	+ return convert_record(record)
	+
	+
	+def create_records(xml):
	+ return [convert_record(rec[0]) for rec in create_records_original(xml)]
	+
	+
	+def print_records(records, encoding='utf-8'):
	+ root = ET.Element('collection',
	+ {'xmlns': 'http://www.loc.gov/MARC21/slim'})
	+
	+ for record in records:
	+ root.append(record._to_element_tree())
	+
	+ return ET.tostring(root, encoding=encoding)
	+
	+
	+class BibRecord(object):
	+ def __init__(self, recid=None):
	+ """Create an empty BibRecord object
	+
	+ If you specify the recid, the record will have a 001 field set
	+ to the value of recid.
	+ """
	+ self.record = {}
	+ if recid:
	+ self.record['001'] = [BibRecordControlField(str(recid))]
	+
	+ def __setitem__(self, tag, fields):
	+ self.record[tag] = fields
	+
	+ def __getitem__(self, tag):
	+ return self.record[tag]
	+
	+ def __eq__(self, b):
	+ if set(self.record.keys()) != set(b.record.keys()):
	+ return False
	+
	+ for tag, fields in self.record.iteritems():
	+ if set(fields) != set(b[tag]):
	+ return False
	+
	+ return True
	+
	+ def __hash__(self):
	+ return hash(tuple(self.record.iteritems()))
	+
	+ def __repr__(self):
	+ if '001' in self.record:
	+ s = u'BibRecord(%s)' % list(self['001'])[0].value
	+ else:
	+ s = u'BibRecord(fields=%s)' % repr(self.record)
	+ return s
	+
	+ def find_subfields(self, tag):
	+ tag_code, ind1, ind2, subfield_code = parse_tag(tag)
	+ results = []
	+ for field in self.record.get(tag_code, []):
	+ if ind1 != '%' and field.ind1 != ind1:
	+ continue
	+
	+ if ind2 != '%' and field.ind2 != ind2:
	+ continue
	+
	+ for subfield in field.subfields:
	+ if subfield_code is None or subfield.code == subfield_code:
	+ results.append(subfield)
	+
	+ return results
	+
	+ def find_fields(self, tag):
	+ tag_code, ind1, ind2, dummy = parse_tag(tag)
	+ results = []
	+ for field in self.record.get(tag_code, []):
	+ if ind1 != '%' and field.ind1 != ind1:
	+ continue
	+
	+ if ind2 != '%' and field.ind2 != ind2:
	+ continue
	+
	+ results.append(field)
	+
	+ return results
	+
	+ def add_field(self, tag):
	+ tag_code, ind1, ind2, dummy = parse_tag(tag)
	+ field = BibRecordField(ind1=ind1, ind2=ind2)
	+ self.record.setdefault(tag_code, []).append(field)
	+ return field
	+
	+ def add_subfield(self, tag, value):
	+ tag_code, ind1, ind2, subfield_code = parse_tag(tag)
	+
	+ subfield = BibRecordSubField(code=subfield_code, value=value)
	+ field = BibRecordField(ind1=ind1, ind2=ind2, subfields=[subfield])
	+ self.record.setdefault(tag_code, []).append(field)
	+ return subfield
	+
	+ def _to_element_tree(self):
	+ root = ET.Element('record')
	+ for tag, fields in sorted(self.record.iteritems(), key=itemgetter(0)):
	+ for field in fields:
	+ if tag.startswith('00'):
	+ controlfield = ET.SubElement(root,
	+ 'controlfield',
	+ {'tag': tag})
	+ controlfield.text = field.value
	+ else:
	+ attribs = {'tag': tag,
	+ 'ind1': field.ind1,
	+ 'ind2': field.ind2}
	+ datafield = ET.SubElement(root, 'datafield', attribs)
	+ for subfield in field.subfields:
	+ attrs = {'code': subfield.code}
	+ s = ET.SubElement(datafield, 'subfield', attrs)
	+ s.text = subfield.value
	+ return root
	+
	+ def to_xml(self, encoding='utf-8'):
	+ return ET.tostring(self._to_element_tree(), encoding=encoding)
	+
	+
	+class BibRecordControlField(object):
	+ def __init__(self, value):
	+ self.value = value
	+
	+ def __eq__(self, b):
	+ return self.value == b.value
	+
	+ def __hash__(self):
	+ return hash(self.value)
	+
	+
	+class BibRecordField(object):
	+ def __init__(self, ind1=" ", ind2=" ", subfields=None):
	+ self.ind1 = ind1
	+ self.ind2 = ind2
	+ if subfields is None:
	+ subfields = []
	+ self.subfields = subfields
	+
	+ def __repr__(self):
	+ return 'BibRecordField(ind1=%s, ind2=%s, subfields=%s)' \
	+ % (repr(self.ind1), repr(self.ind2), repr(self.subfields))
	+
	+ def __eq__(self, b):
	+ return self.ind1 == b.ind1 and self.ind2 == b.ind2 \
	+ and set(self.subfields) == set(b.subfields)
	+
	+ def __hash__(self):
	+ return hash((self.ind1, self.ind2, tuple(self.subfields)))
	+
	+ def get_subfield_values(self, code):
	+ return [s.value for s in self.subfields if s.code == code]
	+
	+ def add_subfield(self, code, value):
	+ subfield = BibRecordSubField(code=code, value=value)
	+ self.subfields.append(subfield)
	+ return subfield
	+
	+
	+class BibRecordSubField(object):
	+ def __init__(self, code, value):
	+ self.code = code
	+ self.value = value
	+
	+ def __repr__(self):
	+ return 'BibRecordSubField(code=%s, value=%s)' \
	+ % (repr(self.code), repr(self.value))
	+
	+ def __eq__(self, b):
	+ return self.code == b.code and self.value == b.value
	+
	+ def __hash__(self):
	+ return hash((self.code, self.value))
	diff --git a/modules/docextract/lib/docextract_record_regression_tests.py b/modules/docextract/lib/docextract_record_regression_tests.py
	new file mode 100644
	index 000000000..7824d3b2c
	--- /dev/null
	+++ b/modules/docextract/lib/docextract_record_regression_tests.py
	@@ -0,0 +1,103 @@
	+# -- coding: utf-8 --
	+##
	+## This file is part of Invenio.
	+## Copyright (C) 2013 CERN.
	+##
	+## Invenio is free software; you can redistribute it and/or
	+## modify it under the terms of the GNU General Public License as
	+## published by the Free Software Foundation; either version 2 of the
	+## License, or (at your option) any later version.
	+##
	+## Invenio is distributed in the hope that it will be useful, but
	+## WITHOUT ANY WARRANTY; without even the implied warranty of
	+## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	+## General Public License for more details.
	+##
	+## You should have received a copy of the GNU General Public License
	+## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	+## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.
	+
	+from invenio.docextract_record import BibRecord, \
	+ get_record, \
	+ create_record, \
	+ create_records
	+from invenio.search_engine import get_record as get_record_original
	+from invenio.search_engine import perform_request_search
	+from invenio.bibrecord import print_rec
	+from invenio.testutils import XmlTest
	+
	+
	+class BibRecordTest(XmlTest):
	+ def setUp(self):
	+ self.maxDiff = None
	+ from invenio import bibrecord
	+
	+ def order_by_tag(field1, field2):
	+ """Function used to order the fields according to their tag"""
	+ return cmp(field1[0], field2[0])
	+ bibrecord._order_by_ord = order_by_tag
	+
	+ self.records_cache = {}
	+ self.xml_cache = {}
	+ for recid in perform_request_search(p=""):
	+ record = get_record(recid)
	+ self.records_cache[recid] = record
	+ self.xml_cache[recid] = record.to_xml()
	+
	+ def test_get_record(self):
	+ for recid in perform_request_search(p=""):
	+ # Our bibrecord we want to test
	+ record = self.records_cache[recid]
	+ # Reference implementation
	+ original_record = get_record_original(recid)
	+ self.assertXmlEqual(record.to_xml(), print_rec(original_record))
	+
	+ def test_create_record(self):
	+ for dummy, xml in self.xml_cache.iteritems():
	+ record = create_record(xml)
	+ self.assertXmlEqual(record.to_xml(), xml)
	+
	+ def test_create_records(self):
	+ xml = '\n'.join(self.xml_cache.itervalues())
	+ records = create_records(xml)
	+ for record in self.records_cache.itervalues():
	+ self.assertEqual(record, records.pop(0))
	+
	+ def test_equality(self):
	+ for recid in self.records_cache.iterkeys():
	+ for recid2 in self.records_cache.iterkeys():
	+ record = self.records_cache[recid]
	+ xml = self.xml_cache[recid]
	+ if recid == recid2:
	+ record2 = get_record(recid)
	+ xml2 = record2.to_xml()
	+ self.assertEqual(record, record2)
	+ self.assertXmlEqual(xml, xml2)
	+ else:
	+ record2 = self.records_cache[recid2]
	+ xml2 = self.xml_cache[recid2]
	+ self.assertNotEqual(record, record2)
	+
	+ def test_hash(self):
	+ for dummy, original_record in self.records_cache.iteritems():
	+ # Our bibrecord we want to test
	+ record = BibRecord()
	+
	+ for tag, fields in original_record.record.iteritems():
	+ record[tag] = list(set(fields))
	+ self.assertEqual(set(record[tag]), set(original_record[tag]))
	+
	+ self.assertEqual(record, original_record)
	+
	+ def test_add_subfield(self):
	+ xml = """<record>
	+ <datafield tag="100" ind1=" " ind2=" ">
	+ <subfield code="a">our title</subfield>
	+ </datafield>
	+ </record>"""
	+ expected_record = create_record(xml)
	+ print expected_record
	+ record = BibRecord()
	+ record.add_subfield('100__a', 'our title')
	+ print record
	+ self.assertEqual(record, expected_record)
	diff --git a/modules/docextract/lib/docextract_task.py b/modules/docextract/lib/docextract_task.py
	index fc522e4ab..6e7dd1425 100644
	--- a/modules/docextract/lib/docextract_task.py
	+++ b/modules/docextract/lib/docextract_task.py
	@@ -1,205 +1,205 @@
	# -- coding: utf-8 --
	##
	## This file is part of Invenio.
	## Copyright (C) 2011, 2012 CERN.
	##
	## Invenio is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## Invenio is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	"""Generic Framework for extracting metadata from records using bibsched"""

	import traceback

	from datetime import datetime
	from itertools import chain
	from invenio.bibtask import task_get_option, write_message, \
	task_sleep_now_if_required, \
	task_update_progress
	from invenio.dbquery import run_sql
	from invenio.search_engine import get_record
	from invenio.search_engine import get_collection_reclist
	from invenio.refextract_api import get_pdf_doc
	from invenio.bibrecord import record_get_field_instances, \
	field_get_subfield_values


	def task_run_core_wrapper(name, core_func, extra_vars=None):
	def fun():
	try:
	return task_run_core(name, core_func, extra_vars)
	except Exception:
	# Remove extra '\n'
	write_message(traceback.format_exc()[:-1])
	raise
	return fun


	def fetch_last_updated(name):
	select_sql = "SELECT last_recid, last_updated FROM xtrJOB" \
	" WHERE name = %s LIMIT 1"
	row = run_sql(select_sql, (name,))
	if not row:
	sql = "INSERT INTO xtrJOB (name, last_updated, last_recid) " \
	"VALUES (%s, '1970-01-01', 0)"
	run_sql(sql, (name,))
	row = run_sql(select_sql, (name,))

	# Fallback in case we receive None instead of a valid date
	last_recid = row[0][0] or 0
	last_date = row[0][1] or datetime(year=1, month=1, day=1)

	return last_recid, last_date


	def store_last_updated(recid, creation_date, name):
	sql = "UPDATE xtrJOB SET last_recid = %s WHERE name=%s AND last_recid < %s"
	run_sql(sql, (recid, name, recid))
	sql = "UPDATE xtrJOB SET last_updated = %s " \
	"WHERE name=%s AND last_updated < %s"
	iso_date = creation_date.isoformat()
	run_sql(sql, (iso_date, name, iso_date))


	def fetch_concerned_records(name):
	task_update_progress("Fetching record ids")

	last_recid, last_date = fetch_last_updated(name)

	if task_get_option('new'):
	# Fetch all records inserted since last run
	sql = "SELECT `id`, `creation_date` FROM `bibrec` " \
	"WHERE `creation_date` >= %s " \
	"AND `id` > %s " \
	"ORDER BY `creation_date`"
	records = run_sql(sql, (last_date.isoformat(), last_recid))
	elif task_get_option('modified'):
	# Fetch all records inserted since last run
	sql = "SELECT `id`, `modification_date` FROM `bibrec` " \
	"WHERE `modification_date` >= %s " \
	"AND `id` > %s " \
	"ORDER BY `modification_date`"
	records = run_sql(sql, (last_date.isoformat(), last_recid))
	else:
	given_recids = task_get_option('recids')
	for collection in task_get_option('collections'):
	given_recids.add(get_collection_reclist(collection))

	if given_recids:
	format_strings = ','.join(['%s'] * len(given_recids))
	records = run_sql("SELECT `id`, NULL FROM `bibrec` " \
	"WHERE `id` IN (%s) ORDER BY `id`" % format_strings,
	list(given_recids))
	else:
	records = []

	task_update_progress("Done fetching record ids")

	return records


	def fetch_concerned_arxiv_records(name):
	task_update_progress("Fetching arxiv record ids")

	dummy, last_date = fetch_last_updated(name)

	# Fetch all records inserted since last run
	sql = "SELECT `id`, `modification_date` FROM `bibrec` " \
	"WHERE `modification_date` >= %s " \
	"AND `creation_date` > NOW() - INTERVAL 7 DAY " \
	"ORDER BY `modification_date`" \
	"LIMIT 5000"
	records = run_sql(sql, [last_date.isoformat()])

	def check_arxiv(recid):
	record = get_record(recid)

	for report_tag in record_get_field_instances(record, "037"):
	for category in field_get_subfield_values(report_tag, 'a'):
	if category.startswith('arXiv'):
	return True
	return False

	def check_pdf_date(recid):
	doc = get_pdf_doc(recid)
	if doc:
	return doc.md > last_date
	return False

	records = [(r, mod_date) for r, mod_date in records if check_arxiv(r)]
	records = [(r, mod_date) for r, mod_date in records if check_pdf_date(r)]
	write_message("recids %s" % repr([(r, mod_date.isoformat()) \
	for r, mod_date in records]))
	task_update_progress("Done fetching arxiv record ids")
	return records


	def process_records(name, records, func, extra_vars):
	count = 1
	total = len(records)
	for recid, date in records:
	task_sleep_now_if_required(can_stop_too=True)
	msg = "Extracting for %s (%d/%d)" % (recid, count, total)
	task_update_progress(msg)
	write_message(msg)
	func(recid, **extra_vars)
	if date:
	store_last_updated(recid, date, name)
	count += 1


	def task_run_core(name, func, extra_vars=None):
	"""Calls extract_references in refextract"""
	if task_get_option('task_specific_name'):
	name = "%s:%s" % (name, task_get_option('task_specific_name'))
	write_message("Starting %s" % name)

	- if not extra_vars:
	+ if extra_vars is None:
	extra_vars = {}

	records = fetch_concerned_records(name)
	process_records(name, records, func, extra_vars)

	if task_get_option('arxiv'):
	extra_vars['_arxiv'] = True
	arxiv_name = "%s:arxiv" % name
	records = fetch_concerned_arxiv_records(arxiv_name)
	process_records(arxiv_name, records, func, extra_vars)

	write_message("Complete")
	return True


	def split_ids(value):
	"""
	Split ids given in the command line
	Possible formats are:
	* 1
	* 1,2,3,4
	* 1-5,20,30,40
	Returns respectively
	* set([1])
	* set([1,2,3,4])
	* set([1,2,3,4,5,20,30,40])
	"""
	def parse(el):
	el = el.strip()
	if not el:
	ret = []
	elif '-' in el:
	start, end = el.split('-', 1)
	ret = xrange(int(start), int(end) + 1)
	else:
	ret = [int(el)]
	return ret
	return chain(*(parse(c) for c in value.split(',') if c.strip()))
	diff --git a/modules/docextract/lib/docextract_text.py b/modules/docextract/lib/docextract_text.py
	index f4129df45..e7a9988da 100644
	--- a/modules/docextract/lib/docextract_text.py
	+++ b/modules/docextract/lib/docextract_text.py
	@@ -1,444 +1,441 @@
	# -- coding: utf-8 --
	##
	## This file is part of Invenio.
	## Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010, 2011 CERN.
	##
	## Invenio is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## Invenio is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	"""Various utilities to manipulate or clean text"""

	import re

	re_space_comma = re.compile(ur'\s,', re.UNICODE)
	re_space_semicolon = re.compile(ur'\s;', re.UNICODE)
	re_space_period = re.compile(ur'\s\.', re.UNICODE)
	re_colon_space_colon = re.compile(ur':\s:', re.UNICODE)
	re_comma_space_colon = re.compile(ur',\s:', re.UNICODE)
	re_space_closing_square_bracket = re.compile(ur'\s\]', re.UNICODE)
	re_opening_square_bracket_space = re.compile(ur'\[\s', re.UNICODE)
	re_hyphens = re.compile(
	ur'(\\255\|\u02D7\|\u0335\|\u0336\|\u2212\|\u002D\|\uFE63\|\uFF0D)', re.UNICODE)
	-re_colon_not_followed_by_numeration_tag = \
	- re.compile(ur':(?!\s*<cds)', re.UNICODE\|re.I)
	re_multiple_space = re.compile(ur'\s{2,}', re.UNICODE)

	re_group_captured_multiple_space = re.compile(ur'(\s{2,})', re.UNICODE)


	def get_url_repair_patterns():
	"""Initialise and return a list of precompiled regexp patterns that
	are used to try to re-assemble URLs that have been broken during
	a document's conversion to plain-text.
	@return: (list) of compiled re regexp patterns used for finding
	various broken URLs.
	"""
	file_types_list = [
	ur'h\st\sm', # htm
	ur'h\st\sm\s*l', # html
	ur't\sx\st' # txt
	ur'p\sh\sp' # php
	ur'a\ss\sp\s*' # asp
	ur'j\ss\sp', # jsp
	ur'p\s*y', # py (python)
	ur'p\s*l', # pl (perl)
	ur'x\sm\sl', # xml
	ur'j\sp\sg', # jpg
	ur'g\si\sf' # gif
	ur'm\so\sv' # mov
	ur's\sw\sf' # swf
	ur'p\sd\sf' # pdf
	ur'p\s*s' # ps
	ur'd\so\sc', # doc
	ur't\se\sx', # tex
	ur's\sh\st\sm\sl', # shtml
	]

	pattern_list = [
	ur'(h\st\st\sp\s\:\s\/\s\/)',
	ur'(f\st\sp\s\:\s\/\s\/\s)',
	ur'((http\|ftp):\/\/\s*[\w\d])',
	ur'((http\|ftp):\/\/([\w\d\s\._\-])+?\s*\/)',
	ur'((http\|ftp):\/\/([\w\d\_\.\-])+\/(([\w\d\_\s\.\-])+?\/)+)',
	ur'((http\|ftp):\/\/([\w\d\_\.\-])+\/(([\w\d\_\s\.\-])+?\/)*([\w\d\_\s\-]+\.\s?[\w\d]+))',
	]
	pattern_list = [re.compile(p, re.I\|re.UNICODE) for p in pattern_list]

	## some possible endings for URLs:
	p = ur'((http\|ftp):\/\/([\w\d\_\.\-])+\/(([\w\d\_\.\-])+?\/)*([\w\d\_\-]+\.%s))'
	for extension in file_types_list:
	p_url = re.compile(p % extension, re.I\|re.UNICODE)
	pattern_list.append(p_url)

	## if url last thing in line, and only 10 letters max, concat them
	p_url = re.compile(
	r'((http\|ftp):\/\/([\w\d\_\.\-])+\/(([\w\d\_\.\-])+?\/)\s?([\w\d\_\.\-]\s?){1,10}\s*)$',
	re.I\|re.UNICODE)
	pattern_list.append(p_url)

	return pattern_list

	## a list of patterns used to try to repair broken URLs within reference lines:
	re_list_url_repair_patterns = get_url_repair_patterns()


	def join_lines(line1, line2):
	"""Join 2 lines of text

	>>> join_lines('abc', 'de')
	'abcde'
	>>> join_lines('a-', 'b')
	'ab'
	"""
	if line1 == u"":
	pass
	elif line1[-1] == u'-':
	## hyphenated word at the end of the
	## line - don't add in a space and remove hyphen
	line1 = line1[:-1]
	elif line1[-1] != u' ':
	## no space at the end of this
	## line, add in a space
	line1 = line1 + u' '
	return line1 + line2


	def repair_broken_urls(line):
	"""Attempt to repair broken URLs in a line of text.

	E.g.: remove spaces from the middle of a URL; something like that.

	@param line: (string) the line in which to check for broken URLs.
	@return: (string) the line after any broken URLs have been repaired.
	"""
	def _chop_spaces_in_url_match(m):
	"""Suppresses spaces in a matched URL."""
	return m.group(1).replace(" ", "")
	for ptn in re_list_url_repair_patterns:
	line = ptn.sub(_chop_spaces_in_url_match, line)
	return line


	def remove_and_record_multiple_spaces_in_line(line):
	"""For a given string, locate all ocurrences of multiple spaces
	together in the line, record the number of spaces found at each
	position, and replace them with a single space.
	@param line: (string) the text line to be processed for multiple
	spaces.
	@return: (tuple) countaining a dictionary and a string. The
	dictionary contains information about the number of spaces removed
	at given positions in the line. For example, if 3 spaces were
	removed from the line at index '22', the dictionary would be set
	as follows: { 22 : 3 }
	The string that is also returned in this tuple is the line after
	multiple-space ocurrences have replaced with single spaces.
	"""
	removed_spaces = {}
	# get a collection of match objects for all instances of
	# multiple-spaces found in the line:
	multispace_matches = re_group_captured_multiple_space.finditer(line)
	# record the number of spaces found at each match position:
	for multispace in multispace_matches:
	removed_spaces[multispace.start()] = \
	(multispace.end() - multispace.start() - 1)
	# now remove the multiple-spaces from the line, replacing with a
	# single space at each position:
	line = re_group_captured_multiple_space.sub(u' ', line)
	return (removed_spaces, line)


	def wash_line(line):
	"""Wash a text line of certain punctuation errors, replacing them with
	more correct alternatives. E.g.: the string 'Yes , I like python.'
	will be transformed into 'Yes, I like python.'
	@param line: (string) the line to be washed.
	@return: (string) the washed line.
	"""
	line = re_space_comma.sub(',', line)
	line = re_space_semicolon.sub(';', line)
	line = re_space_period.sub('.', line)
	line = re_colon_space_colon.sub(':', line)
	line = re_comma_space_colon.sub(':', line)
	line = re_space_closing_square_bracket.sub(']', line)
	line = re_opening_square_bracket_space.sub('[', line)
	line = re_hyphens.sub('-', line)
	- line = re_colon_not_followed_by_numeration_tag.sub(' ', line)
	line = re_multiple_space.sub(' ', line)
	return line


	def remove_page_boundary_lines(docbody):
	"""Try to locate page breaks, headers and footers within a document body,
	and remove the array cells at which they are found.
	@param docbody: (list) of strings, each string being a line in the
	document's body.
	@return: (list) of strings. The document body, hopefully with page-
	breaks, headers and footers removed. Each string in the list once more
	represents a line in the document.
	"""
	number_head_lines = number_foot_lines = 0
	## Make sure document not just full of whitespace:
	if not document_contains_text(docbody):
	## document contains only whitespace - cannot safely
	## strip headers/footers
	return docbody

	## Get list of index posns of pagebreaks in document:
	page_break_posns = get_page_break_positions(docbody)

	## Get num lines making up each header if poss:
	number_head_lines = get_number_header_lines(docbody, page_break_posns)

	## Get num lines making up each footer if poss:
	number_foot_lines = get_number_footer_lines(docbody, page_break_posns)

	## Remove pagebreaks,headers,footers:
	docbody = strip_headers_footers_pagebreaks(docbody, \
	page_break_posns, \
	number_head_lines, \
	number_foot_lines)

	return docbody


	def document_contains_text(docbody):
	"""Test whether document contains text, or is just full of worthless
	whitespace.
	@param docbody: (list) of strings - each string being a line of the
	document's body
	@return: (integer) 1 if non-whitespace found in document; 0 if only
	whitespace found in document.
	"""
	found_non_space = 0
	for line in docbody:
	if not line.isspace():
	## found a non-whitespace character in this line
	found_non_space = 1
	break
	return found_non_space


	def get_page_break_positions(docbody):
	"""Locate page breaks in the list of document lines and create a list
	positions in the document body list.
	@param docbody: (list) of strings - each string is a line in the
	document.
	@return: (list) of integer positions, whereby each integer represents the
	position (in the document body) of a page-break.
	"""
	page_break_posns = []
	p_break = re.compile(ur'^\s\f\s$', re.UNICODE)
	num_document_lines = len(docbody)
	for i in xrange(num_document_lines):
	if p_break.match(docbody[i]) != None:
	page_break_posns.append(i)
	return page_break_posns


	def get_number_header_lines(docbody, page_break_posns):
	"""Try to guess the number of header lines each page of a document has.
	The positions of the page breaks in the document are used to try to guess
	the number of header lines.
	@param docbody: (list) of strings - each string being a line in the
	document
	@param page_break_posns: (list) of integers - each integer is the
	position of a page break in the document.
	@return: (int) the number of lines that make up the header of each page.
	"""
	remaining_breaks = len(page_break_posns) - 1
	num_header_lines = empty_line = 0
	## pattern to search for a word in a line:
	p_wordSearch = re.compile(ur'([A-Za-z0-9-]+)', re.UNICODE)
	if remaining_breaks > 2:
	if remaining_breaks > 3:
	# Only check odd page headers
	next_head = 2
	else:
	# Check headers on each page
	next_head = 1
	keep_checking = 1
	while keep_checking:
	cur_break = 1
	if docbody[(page_break_posns[cur_break] \
	+ num_header_lines + 1)].isspace():
	## this is a blank line
	empty_line = 1

	if (page_break_posns[cur_break] + num_header_lines + 1) \
	== (page_break_posns[(cur_break + 1)]):
	## Have reached next page-break: document has no
	## body - only head/footers!
	keep_checking = 0

	grps_headLineWords = \
	p_wordSearch.findall(docbody[(page_break_posns[cur_break] \
	+ num_header_lines + 1)])
	cur_break = cur_break + next_head
	while (cur_break < remaining_breaks) and keep_checking:
	grps_thisLineWords = \
	p_wordSearch.findall(docbody[(page_break_posns[cur_break] \
	+ num_header_lines + 1)])
	if empty_line:
	if len(grps_thisLineWords) != 0:
	## This line should be empty, but isn't
	keep_checking = 0
	else:
	if (len(grps_thisLineWords) == 0) or \
	(len(grps_headLineWords) != len(grps_thisLineWords)):
	## Not same num 'words' as equivilent line
	## in 1st header:
	keep_checking = 0
	else:
	keep_checking = \
	check_boundary_lines_similar(grps_headLineWords, \
	grps_thisLineWords)
	## Update cur_break for nxt line to check
	cur_break = cur_break + next_head
	if keep_checking:
	## Line is a header line: check next
	num_header_lines = num_header_lines + 1
	empty_line = 0
	return num_header_lines


	def get_number_footer_lines(docbody, page_break_posns):
	"""Try to guess the number of footer lines each page of a document has.
	The positions of the page breaks in the document are used to try to guess
	the number of footer lines.
	@param docbody: (list) of strings - each string being a line in the
	document
	@param page_break_posns: (list) of integers - each integer is the
	position of a page break in the document.
	@return: (int) the number of lines that make up the footer of each page.
	"""
	num_breaks = len(page_break_posns)
	num_footer_lines = 0
	empty_line = 0
	keep_checking = 1
	p_wordSearch = re.compile(unicode(r'([A-Za-z0-9-]+)'), re.UNICODE)
	if num_breaks > 2:
	while keep_checking:
	cur_break = 1
	if page_break_posns[cur_break] - num_footer_lines - 1 < 0 or \
	page_break_posns[cur_break] - num_footer_lines - 1 > \
	len(docbody) - 1:
	## Be sure that the docbody list boundary wasn't overstepped:
	break
	if docbody[(page_break_posns[cur_break] \
	- num_footer_lines - 1)].isspace():
	empty_line = 1
	grps_headLineWords = \
	p_wordSearch.findall(docbody[(page_break_posns[cur_break] \
	- num_footer_lines - 1)])
	cur_break = cur_break + 1
	while (cur_break < num_breaks) and keep_checking:
	grps_thisLineWords = \
	p_wordSearch.findall(docbody[(page_break_posns[cur_break] \
	- num_footer_lines - 1)])
	if empty_line:
	if len(grps_thisLineWords) != 0:
	## this line should be empty, but isn't
	keep_checking = 0
	else:
	if (len(grps_thisLineWords) == 0) or \
	(len(grps_headLineWords) != len(grps_thisLineWords)):
	## Not same num 'words' as equivilent line
	## in 1st footer:
	keep_checking = 0
	else:
	keep_checking = \
	check_boundary_lines_similar(grps_headLineWords, \
	grps_thisLineWords)
	## Update cur_break for nxt line to check
	cur_break = cur_break + 1
	if keep_checking:
	## Line is a footer line: check next
	num_footer_lines = num_footer_lines + 1
	empty_line = 0
	return num_footer_lines


	def strip_headers_footers_pagebreaks(docbody,
	page_break_posns,
	num_head_lines,
	num_foot_lines):
	"""Remove page-break lines, header lines, and footer lines from the
	document.
	@param docbody: (list) of strings, whereby each string in the list is a
	line in the document.
	@param page_break_posns: (list) of integers, whereby each integer
	represents the index in docbody at which a page-break is found.
	@param num_head_lines: (int) the number of header lines each page in the
	document has.
	@param num_foot_lines: (int) the number of footer lines each page in the
	document has.
	@return: (list) of strings - the document body after the headers,
	footers, and page-break lines have been stripped from the list.
	"""
	num_breaks = len(page_break_posns)
	page_lens = []
	for x in xrange(0, num_breaks):
	if x < num_breaks - 1:
	page_lens.append(page_break_posns[x + 1] - page_break_posns[x])
	page_lens.sort()
	if (len(page_lens) > 0) and \
	(num_head_lines + num_foot_lines + 1 < page_lens[0]):
	## Safe to chop hdrs & ftrs
	page_break_posns.reverse()
	first = 1
	for i in xrange(0, len(page_break_posns)):
	## Unless this is the last page break, chop headers
	if not first:
	for dummy in xrange(1, num_head_lines + 1):
	docbody[page_break_posns[i] \
	+ 1:page_break_posns[i] + 2] = []
	else:
	first = 0
	## Chop page break itself
	docbody[page_break_posns[i]:page_break_posns[i] + 1] = []
	## Chop footers (unless this is the first page break)
	if i != len(page_break_posns) - 1:
	for dummy in xrange(1, num_foot_lines + 1):
	docbody[page_break_posns[i] \
	- num_foot_lines:page_break_posns[i] \
	- num_foot_lines + 1] = []
	return docbody


	def check_boundary_lines_similar(l_1, l_2):
	"""Compare two lists to see if their elements are roughly the same.
	@param l_1: (list) of strings.
	@param l_2: (list) of strings.
	@return: (int) 1/0.
	"""
	num_matches = 0
	if (type(l_1) != list) or (type(l_2) != list) or (len(l_1) != len(l_2)):
	## these 'boundaries' are not similar
	return 0

	num_elements = len(l_1)
	for i in xrange(0, num_elements):
	if l_1[i].isdigit() and l_2[i].isdigit():
	## both lines are integers
	num_matches += 1
	else:
	l1_str = l_1[i].lower()
	l2_str = l_2[i].lower()
	if (l1_str[0] == l2_str[0]) and \
	(l1_str[len(l1_str) - 1] == l2_str[len(l2_str) - 1]):
	num_matches = num_matches + 1
	if (len(l_1) == 0) or (float(num_matches) / float(len(l_1)) < 0.9):
	return 0
	else:
	return 1
	diff --git a/modules/docextract/lib/docextract_utils.py b/modules/docextract/lib/docextract_utils.py
	index f4ac313ac..a82278a56 100644
	--- a/modules/docextract/lib/docextract_utils.py
	+++ b/modules/docextract/lib/docextract_utils.py
	@@ -1,45 +1,47 @@
	# -- coding: utf-8 --
	##
	## This file is part of Invenio.
	## Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010, 2011 CERN.
	##
	## Invenio is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## Invenio is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	VERBOSITY = None
	+USE_BIBTASK = False

	import sys
	from datetime import datetime

	from invenio.bibtask import write_message as bibtask_write_message


	-def setup_loggers(verbosity):
	- global VERBOSITY
	+def setup_loggers(verbosity, use_bibtask=False):
	+ global VERBOSITY, USE_BIBTASK

	if verbosity > 8:
	print 'Setting up loggers: verbosity=%s' % verbosity

	VERBOSITY = verbosity
	+ USE_BIBTASK = use_bibtask


	def write_message(msg, stream=sys.stdout, verbose=1):
	"""Write message and flush output stream (may be sys.stdout or sys.stderr).
	Useful for debugging stuff."""
	- if VERBOSITY is None:
	+ if USE_BIBTASK:
	return bibtask_write_message(msg, stream, verbose)
	- elif msg and VERBOSITY >= verbose:
	+ elif VERBOSITY and msg and VERBOSITY >= verbose:
	if VERBOSITY > 8:
	print >>stream, datetime.now().strftime('[%H:%M:%S] '),
	print >>stream, msg
	diff --git a/modules/docextract/lib/docextract_webinterface.py b/modules/docextract/lib/docextract_webinterface.py
	index dfd1b7822..4336ea0f1 100644
	--- a/modules/docextract/lib/docextract_webinterface.py
	+++ b/modules/docextract/lib/docextract_webinterface.py
	@@ -1,198 +1,198 @@
	# -- coding: utf-8 --
	##
	## This file is part of Invenio.
	## Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010, 2011 CERN.
	##
	## Invenio is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## Invenio is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	"""DocExtract REST and Web API

	Exposes document extration facilities to the world
	"""

	from tempfile import NamedTemporaryFile

	from invenio.webinterface_handler import WebInterfaceDirectory
	from invenio.webuser import collect_user_info
	from invenio.webpage import page
	from invenio.config import CFG_TMPSHAREDDIR, CFG_ETCDIR
	from invenio.refextract_api import extract_references_from_file_xml, \
	extract_references_from_url_xml, \
	extract_references_from_string_xml
	from invenio.bibformat_engine import format_record


	def check_login(req):
	"""Check that the user is logged in"""
	user_info = collect_user_info(req)
	if user_info['email'] == 'guest':
	# 1. User is guest: must login prior to upload
	# return 'Please login before uploading file.'
	pass


	def check_url(url):
	"""Check that the url we received is not gibberish"""
	return url.startswith('http://') or \
	url.startswith('https://') or \
	url.startswith('ftp://')


	def extract_from_pdf_string(pdf):
	"""Extract references from a pdf stored in a string

	Given a string representing a pdf, this function writes the string to
	disk and passes it to refextract.
	We need to create a temoporary file because we need to run pdf2text on it"""
	# Save new record to file
	tf = NamedTemporaryFile(prefix='docextract-pdf',
	dir=CFG_TMPSHAREDDIR)
	try:
	tf.write(pdf)
	tf.flush()
	refs = extract_references_from_file_xml(tf.name)
	finally:
	# Also deletes the file
	tf.close()

	return refs


	def make_arxiv_url(arxiv_id):
	"""Make a url we can use to download a pdf from arxiv

	Arguments:
	arxiv_id -- the arxiv id of the record to link to
	"""
	return "http://arxiv.org/pdf/%s.pdf" % arxiv_id


	class WebInterfaceAPIDocExtract(WebInterfaceDirectory):
	"""DocExtract REST API"""
	_exports = [
	('extract-references-pdf', 'extract_references_pdf'),
	('extract-references-pdf-url', 'extract_references_pdf_url'),
	('extract-references-txt', 'extract_references_txt'),
	]

	def extract_references_pdf(self, req, form):
	"""Extract references from uploaded pdf"""
	check_login(req)

	if 'pdf' not in form:
	return 'No PDF file uploaded'

	return extract_from_pdf_string(form['pdf'].file.read())

	def extract_references_pdf_url(self, req, form):
	"""Extract references from the pdf pointed by the passed url"""
	check_login(req)

	if 'url' not in form:
	return 'No URL specified'

	url = form['url'].value

	if not check_url(url):
	return 'Invalid URL specified'

	return extract_references_from_url_xml(url)

	def extract_references_txt(self, req, form):
	"""Extract references from plain text"""
	check_login(req)

	if 'txt' not in form:
	return 'No text specified'

	txt = form['txt'].value

	return extract_references_from_string_xml(txt)


	class WebInterfaceDocExtract(WebInterfaceDirectory):
	"""DocExtract API"""
	_exports = ['api',
	- ('extract-references', 'extract_references'),
	+ ('', 'extract'),
	('example.pdf', 'example_pdf'),
	]

	api = WebInterfaceAPIDocExtract()

	def example_pdf(self, req, _form):
	"""Serve a test pdf for tests"""
	f = open("%s/docextract/example.pdf" % CFG_ETCDIR, 'rb')
	try:
	req.write(f.read())
	finally:
	f.close()

	- def extract_references_template(self):
	+ def extract_template(self):
	"""Template for reference extraction page"""
	return """Please specify a pdf or a url or some references to parse

	- <form action="extract-references" method="post"
	+ <form action="" method="post"
	enctype="multipart/form-data">
	<p>PDF: <input type="file" name="pdf" /></p>
	<p>arXiv: <input type="text" name="arxiv" /></p>
	<p>URL: <input type="text" name="url" style="width: 600px;"/></p>
	<textarea name="txt" style="width: 500px; height: 500px;"></textarea>
	<p><input type="submit" /></p>
	</form>
	"""

	- def extract_references(self, req, form):
	+ def extract(self, req, form):
	"""Refrences extraction page

	This page can be used for authors to test their pdfs against our
	refrences extraction process"""
	user_info = collect_user_info(req)

	# Handle the 3 POST parameters
	if 'pdf' in form and form['pdf'].value:
	pdf = form['pdf'].value
	references_xml = extract_from_pdf_string(pdf)
	elif 'arxiv' in form and form['arxiv'].value:
	url = make_arxiv_url(arxiv_id=form['arxiv'].value)
	references_xml = extract_references_from_url_xml(url)
	elif 'url' in form and form['url'].value:
	url = form['url'].value
	references_xml = extract_references_from_url_xml(url)
	elif 'txt' in form and form['txt'].value:
	- txt = form['txt'].value
	+ txt = form['txt'].value.decode('utf-8', errors='ignore')
	references_xml = extract_references_from_string_xml(txt)
	else:
	references_xml = None

	# If we have not uploaded anything yet
	# Display the form that allows us to do so
	if not references_xml:
	- out = self.extract_references_template()
	+ out = self.extract_template()
	else:
	out = """
	<style type="text/css">
	#referenceinp_link { display: none; }
	</style>
	"""
	out += format_record(0,
	'hdref',
	- xml_record=references_xml.encode('utf-8'),
	+ xml_record=references_xml,
	user_info=user_info)

	# Render the page (including header, footer)
	return page(title='References Extractor',
	body=out,
	uid=user_info['uid'],
	req=req)
	diff --git a/modules/docextract/lib/docextract_webinterface_unit_tests.py b/modules/docextract/lib/docextract_webinterface_regression_tests.py
	similarity index 98%
	copy from modules/docextract/lib/docextract_webinterface_unit_tests.py
	copy to modules/docextract/lib/docextract_webinterface_regression_tests.py
	index 6ee535509..e5cc48e53 100644
	--- a/modules/docextract/lib/docextract_webinterface_unit_tests.py
	+++ b/modules/docextract/lib/docextract_webinterface_regression_tests.py
	@@ -1,205 +1,205 @@
	# -- coding: utf-8 --
	##
	## This file is part of Invenio.
	-## Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010, 2011 CERN.
	+## Copyright (C) 2012 CERN.
	##
	## Invenio is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## Invenio is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	-from invenio.testutils import InvenioTestCase
	+import unittest
	try:
	import requests
	HAS_REQUESTS = True
	except ImportError:
	HAS_REQUESTS = False
	from invenio.testutils import make_test_suite, run_test_suite
	from invenio.config import CFG_SITE_URL, CFG_ETCDIR, CFG_INSPIRE_SITE
	from invenio.bibrecord import create_record, record_xml_output, record_delete_field

	if CFG_INSPIRE_SITE:
	EXPECTED_RESPONSE = """<record>
	<controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">1</subfield>
	<subfield code="h">D. Clowe, A. Gonzalez, and M. Markevitch</subfield>
	<subfield code="s">Astrophys. J.,604,596</subfield>
	<subfield code="y">2004</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">2</subfield>
	<subfield code="h">C. L. Sarazin, X-Ray Emission</subfield>
	<subfield code="m">from Clusters of Galaxies (Cambridge University Press, Cambridge 1988)</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">3</subfield>
	<subfield code="h">M. Girardi, G. Giuricin, F. Mardirossian, M. Mezzetti, and W. Boschin</subfield>
	<subfield code="s">Astrophys. J.,505,74</subfield>
	<subfield code="y">1998</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">4</subfield>
	<subfield code="h">D. A. White, C. Jones, and W. Forman</subfield>
	<subfield code="s">Mon. Not. R. Astron. Soc.,292,419</subfield>
	<subfield code="y">1997</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">5</subfield>
	<subfield code="h">V.C. Rubin, N. Thonnard, and W. K. Ford</subfield>
	<subfield code="s">Astrophys. J.,238,471</subfield>
	<subfield code="y">1980</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">6</subfield>
	<subfield code="h">A. Bosma</subfield>
	<subfield code="s">Astron. J.,86,1825</subfield>
	<subfield code="y">1981</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">7</subfield>
	<subfield code="h">S.M. Faber and J.S. Gallagher</subfield>
	<subfield code="s">Annu. Rev. Astron. Astrophys.,17,135</subfield>
	<subfield code="y">1979</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">8</subfield>
	<subfield code="h">M. Persic, P. Salucci, and F. Stel</subfield>
	<subfield code="s">Mon. Not. R. Astron. Soc.,281,27</subfield>
	<subfield code="y">1996</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">9</subfield>
	<subfield code="h">M. Lowewnstein and R. E. White</subfield>
	<subfield code="s">Astrophys. J.,518,50</subfield>
	<subfield code="y">1999</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">10</subfield>
	<subfield code="h">D. P. Clemens</subfield>
	<subfield code="s">Astrophys. J.,295,422</subfield>
	<subfield code="y">1985</subfield>
	</datafield>
	</record>
	"""
	else:
	EXPECTED_RESPONSE = """<record>
	<controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">1</subfield>
	<subfield code="h">D. Clowe, A. Gonzalez, and M. Markevitch</subfield>
	<subfield code="s">Astrophys. J. 604 (2004) 596</subfield>
	<subfield code="y">2004</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">2</subfield>
	<subfield code="h">C. L. Sarazin, X-Ray Emission</subfield>
	<subfield code="m">from Clusters of Galaxies (Cambridge University Press, Cambridge 1988)</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">3</subfield>
	<subfield code="h">M. Girardi, G. Giuricin, F. Mardirossian, M. Mezzetti, and W. Boschin</subfield>
	<subfield code="s">Astrophys. J. 505 (1998) 74</subfield>
	<subfield code="y">1998</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">4</subfield>
	<subfield code="h">D. A. White, C. Jones, and W. Forman</subfield>
	<subfield code="s">Mon. Not. R. Astron. Soc. 292 (1997) 419</subfield>
	<subfield code="y">1997</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">5</subfield>
	<subfield code="h">V.C. Rubin, N. Thonnard, and W. K. Ford</subfield>
	<subfield code="s">Astrophys. J. 238 (1980) 471</subfield>
	<subfield code="y">1980</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">6</subfield>
	<subfield code="h">A. Bosma</subfield>
	<subfield code="s">Astron. J. 86 (1981) 1825</subfield>
	<subfield code="y">1981</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">7</subfield>
	<subfield code="h">S.M. Faber and J.S. Gallagher</subfield>
	<subfield code="s">Annu. Rev. Astron. Astrophys. 17 (1979) 135</subfield>
	<subfield code="y">1979</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">8</subfield>
	<subfield code="h">M. Persic, P. Salucci, and F. Stel</subfield>
	<subfield code="s">Mon. Not. R. Astron. Soc. 281 (1996) 27</subfield>
	<subfield code="y">1996</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">9</subfield>
	<subfield code="h">M. Lowewnstein and R. E. White</subfield>
	<subfield code="s">Astrophys. J. 518 (1999) 50</subfield>
	<subfield code="y">1999</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">10</subfield>
	<subfield code="h">D. P. Clemens</subfield>
	<subfield code="s">Astrophys. J. 295 (1985) 422</subfield>
	<subfield code="y">1985</subfield>
	</datafield>
	</record>"""


	def compare_references(test, a, b):
	## Let's normalize records to remove the Invenio refextract signature
	a = create_record(a)[0]
	b = create_record(b)[0]
	record_delete_field(a, '999', 'C', '6')
	a = record_xml_output(a)
	b = record_xml_output(b)
	test.assertEqual(a, b)


	-class DocExtractTest(InvenioTestCase):
	+class DocExtractTest(unittest.TestCase):
	def setUp(self):
	#setup_loggers(verbosity=1)
	self.maxDiff = 10000

	if HAS_REQUESTS:
	def test_upload(self):
	url = CFG_SITE_URL + '/textmining/api/extract-references-pdf'

	pdf = open("%s/docextract/example.pdf" % CFG_ETCDIR, 'rb')
	response = requests.post(url, files={'pdf': pdf})
	# Remove stats tag
	lines = response.content.split('\n')
	lines[-6:-1] = []
	compare_references(self, '\n'.join(lines), EXPECTED_RESPONSE)

	def test_url(self):
	url = CFG_SITE_URL + '/textmining/api/extract-references-pdf-url'

	pdf = CFG_SITE_URL + '/textmining/example.pdf'
	response = requests.post(url, data={'url': pdf})
	compare_references(self, response.content, EXPECTED_RESPONSE)

	def test_txt(self):
	url = CFG_SITE_URL + '/textmining/api/extract-references-txt'

	pdf = open("%s/docextract/example.txt" % CFG_ETCDIR, 'rb')
	response = requests.post(url, files={'txt': pdf})
	# Remove stats tag
	lines = response.content.split('\n')
	lines[-6:-1] = []
	compare_references(self, '\n'.join(lines), EXPECTED_RESPONSE)

	TEST_SUITE = make_test_suite(DocExtractTest)

	if __name__ == '__main__':
	run_test_suite(TEST_SUITE)
	diff --git a/modules/docextract/lib/docextract_webinterface_unit_tests.py b/modules/docextract/lib/docextract_webinterface_unit_tests.py
	index 6ee535509..bed0de3ce 100644
	--- a/modules/docextract/lib/docextract_webinterface_unit_tests.py
	+++ b/modules/docextract/lib/docextract_webinterface_unit_tests.py
	@@ -1,205 +1,34 @@
	# -- coding: utf-8 --
	##
	## This file is part of Invenio.
	-## Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010, 2011 CERN.
	+## Copyright (C) 2010, 2011, 2013 CERN.
	##
	## Invenio is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## Invenio is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	-from invenio.testutils import InvenioTestCase
	-try:
	- import requests
	- HAS_REQUESTS = True
	-except ImportError:
	- HAS_REQUESTS = False
	-from invenio.testutils import make_test_suite, run_test_suite
	-from invenio.config import CFG_SITE_URL, CFG_ETCDIR, CFG_INSPIRE_SITE
	-from invenio.bibrecord import create_record, record_xml_output, record_delete_field
	-
	-if CFG_INSPIRE_SITE:
	- EXPECTED_RESPONSE = """<record>
	- <controlfield tag="001">1</controlfield>
	- <datafield tag="999" ind1="C" ind2="5">
	- <subfield code="o">1</subfield>
	- <subfield code="h">D. Clowe, A. Gonzalez, and M. Markevitch</subfield>
	- <subfield code="s">Astrophys. J.,604,596</subfield>
	- <subfield code="y">2004</subfield>
	- </datafield>
	- <datafield tag="999" ind1="C" ind2="5">
	- <subfield code="o">2</subfield>
	- <subfield code="h">C. L. Sarazin, X-Ray Emission</subfield>
	- <subfield code="m">from Clusters of Galaxies (Cambridge University Press, Cambridge 1988)</subfield>
	- </datafield>
	- <datafield tag="999" ind1="C" ind2="5">
	- <subfield code="o">3</subfield>
	- <subfield code="h">M. Girardi, G. Giuricin, F. Mardirossian, M. Mezzetti, and W. Boschin</subfield>
	- <subfield code="s">Astrophys. J.,505,74</subfield>
	- <subfield code="y">1998</subfield>
	- </datafield>
	- <datafield tag="999" ind1="C" ind2="5">
	- <subfield code="o">4</subfield>
	- <subfield code="h">D. A. White, C. Jones, and W. Forman</subfield>
	- <subfield code="s">Mon. Not. R. Astron. Soc.,292,419</subfield>
	- <subfield code="y">1997</subfield>
	- </datafield>
	- <datafield tag="999" ind1="C" ind2="5">
	- <subfield code="o">5</subfield>
	- <subfield code="h">V.C. Rubin, N. Thonnard, and W. K. Ford</subfield>
	- <subfield code="s">Astrophys. J.,238,471</subfield>
	- <subfield code="y">1980</subfield>
	- </datafield>
	- <datafield tag="999" ind1="C" ind2="5">
	- <subfield code="o">6</subfield>
	- <subfield code="h">A. Bosma</subfield>
	- <subfield code="s">Astron. J.,86,1825</subfield>
	- <subfield code="y">1981</subfield>
	- </datafield>
	- <datafield tag="999" ind1="C" ind2="5">
	- <subfield code="o">7</subfield>
	- <subfield code="h">S.M. Faber and J.S. Gallagher</subfield>
	- <subfield code="s">Annu. Rev. Astron. Astrophys.,17,135</subfield>
	- <subfield code="y">1979</subfield>
	- </datafield>
	- <datafield tag="999" ind1="C" ind2="5">
	- <subfield code="o">8</subfield>
	- <subfield code="h">M. Persic, P. Salucci, and F. Stel</subfield>
	- <subfield code="s">Mon. Not. R. Astron. Soc.,281,27</subfield>
	- <subfield code="y">1996</subfield>
	- </datafield>
	- <datafield tag="999" ind1="C" ind2="5">
	- <subfield code="o">9</subfield>
	- <subfield code="h">M. Lowewnstein and R. E. White</subfield>
	- <subfield code="s">Astrophys. J.,518,50</subfield>
	- <subfield code="y">1999</subfield>
	- </datafield>
	- <datafield tag="999" ind1="C" ind2="5">
	- <subfield code="o">10</subfield>
	- <subfield code="h">D. P. Clemens</subfield>
	- <subfield code="s">Astrophys. J.,295,422</subfield>
	- <subfield code="y">1985</subfield>
	- </datafield>
	-</record>
	"""
	-else:
	- EXPECTED_RESPONSE = """<record>
	- <controlfield tag="001">1</controlfield>
	- <datafield tag="999" ind1="C" ind2="5">
	- <subfield code="o">1</subfield>
	- <subfield code="h">D. Clowe, A. Gonzalez, and M. Markevitch</subfield>
	- <subfield code="s">Astrophys. J. 604 (2004) 596</subfield>
	- <subfield code="y">2004</subfield>
	- </datafield>
	- <datafield tag="999" ind1="C" ind2="5">
	- <subfield code="o">2</subfield>
	- <subfield code="h">C. L. Sarazin, X-Ray Emission</subfield>
	- <subfield code="m">from Clusters of Galaxies (Cambridge University Press, Cambridge 1988)</subfield>
	- </datafield>
	- <datafield tag="999" ind1="C" ind2="5">
	- <subfield code="o">3</subfield>
	- <subfield code="h">M. Girardi, G. Giuricin, F. Mardirossian, M. Mezzetti, and W. Boschin</subfield>
	- <subfield code="s">Astrophys. J. 505 (1998) 74</subfield>
	- <subfield code="y">1998</subfield>
	- </datafield>
	- <datafield tag="999" ind1="C" ind2="5">
	- <subfield code="o">4</subfield>
	- <subfield code="h">D. A. White, C. Jones, and W. Forman</subfield>
	- <subfield code="s">Mon. Not. R. Astron. Soc. 292 (1997) 419</subfield>
	- <subfield code="y">1997</subfield>
	- </datafield>
	- <datafield tag="999" ind1="C" ind2="5">
	- <subfield code="o">5</subfield>
	- <subfield code="h">V.C. Rubin, N. Thonnard, and W. K. Ford</subfield>
	- <subfield code="s">Astrophys. J. 238 (1980) 471</subfield>
	- <subfield code="y">1980</subfield>
	- </datafield>
	- <datafield tag="999" ind1="C" ind2="5">
	- <subfield code="o">6</subfield>
	- <subfield code="h">A. Bosma</subfield>
	- <subfield code="s">Astron. J. 86 (1981) 1825</subfield>
	- <subfield code="y">1981</subfield>
	- </datafield>
	- <datafield tag="999" ind1="C" ind2="5">
	- <subfield code="o">7</subfield>
	- <subfield code="h">S.M. Faber and J.S. Gallagher</subfield>
	- <subfield code="s">Annu. Rev. Astron. Astrophys. 17 (1979) 135</subfield>
	- <subfield code="y">1979</subfield>
	- </datafield>
	- <datafield tag="999" ind1="C" ind2="5">
	- <subfield code="o">8</subfield>
	- <subfield code="h">M. Persic, P. Salucci, and F. Stel</subfield>
	- <subfield code="s">Mon. Not. R. Astron. Soc. 281 (1996) 27</subfield>
	- <subfield code="y">1996</subfield>
	- </datafield>
	- <datafield tag="999" ind1="C" ind2="5">
	- <subfield code="o">9</subfield>
	- <subfield code="h">M. Lowewnstein and R. E. White</subfield>
	- <subfield code="s">Astrophys. J. 518 (1999) 50</subfield>
	- <subfield code="y">1999</subfield>
	- </datafield>
	- <datafield tag="999" ind1="C" ind2="5">
	- <subfield code="o">10</subfield>
	- <subfield code="h">D. P. Clemens</subfield>
	- <subfield code="s">Astrophys. J. 295 (1985) 422</subfield>
	- <subfield code="y">1985</subfield>
	- </datafield>
	-</record>"""
	-
	-
	-def compare_references(test, a, b):
	- ## Let's normalize records to remove the Invenio refextract signature
	- a = create_record(a)[0]
	- b = create_record(b)[0]
	- record_delete_field(a, '999', 'C', '6')
	- a = record_xml_output(a)
	- b = record_xml_output(b)
	- test.assertEqual(a, b)
	-
	-
	-class DocExtractTest(InvenioTestCase):
	- def setUp(self):
	- #setup_loggers(verbosity=1)
	- self.maxDiff = 10000
	-
	- if HAS_REQUESTS:
	- def test_upload(self):
	- url = CFG_SITE_URL + '/textmining/api/extract-references-pdf'
	-
	- pdf = open("%s/docextract/example.pdf" % CFG_ETCDIR, 'rb')
	- response = requests.post(url, files={'pdf': pdf})
	- # Remove stats tag
	- lines = response.content.split('\n')
	- lines[-6:-1] = []
	- compare_references(self, '\n'.join(lines), EXPECTED_RESPONSE)
	-
	- def test_url(self):
	- url = CFG_SITE_URL + '/textmining/api/extract-references-pdf-url'
	-
	- pdf = CFG_SITE_URL + '/textmining/example.pdf'
	- response = requests.post(url, data={'url': pdf})
	- compare_references(self, response.content, EXPECTED_RESPONSE)
	+The DocExtract web tests
	+"""

	- def test_txt(self):
	- url = CFG_SITE_URL + '/textmining/api/extract-references-txt'
	+# Note: unit tests were moved to the regression test suite. Keeping
	+# this file here with empty test case set in order to overwrite any
	+# previously installed file. Also, keeping TEST_SUITE empty so that
	+# `inveniocfg --run-unit-tests' would not complain.

	- pdf = open("%s/docextract/example.txt" % CFG_ETCDIR, 'rb')
	- response = requests.post(url, files={'txt': pdf})
	- # Remove stats tag
	- lines = response.content.split('\n')
	- lines[-6:-1] = []
	- compare_references(self, '\n'.join(lines), EXPECTED_RESPONSE)
	+from invenio.testutils import make_test_suite, run_test_suite

	-TEST_SUITE = make_test_suite(DocExtractTest)
	+TEST_SUITE = make_test_suite()

	-if __name__ == '__main__':
	+if __name__ == "__main__":
	run_test_suite(TEST_SUITE)
	diff --git a/modules/docextract/lib/refextract_api.py b/modules/docextract/lib/refextract_api.py
	index 50840f8f9..564fc6ee4 100644
	--- a/modules/docextract/lib/refextract_api.py
	+++ b/modules/docextract/lib/refextract_api.py
	@@ -1,273 +1,299 @@
	# -- coding: utf-8 --
	##
	## This file is part of Invenio.
	## Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010, 2011 CERN.
	##
	## Invenio is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## Invenio is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	"""This is where all the public API calls are accessible

	This is the only file containing public calls and everything that is
	present here can be considered private by the invenio modules.
	"""


	import os

	from urllib import urlretrieve
	from tempfile import mkstemp

	from invenio.refextract_engine import parse_references, \
	get_plaintext_document_body, \
	parse_reference_line, \
	get_kbs
	from invenio.refextract_text import extract_references_from_fulltext
	from invenio.search_engine_utils import get_fieldvalues
	from invenio.bibindex_tokenizers.BibIndexJournalTokenizer import \
	CFG_JOURNAL_PUBINFO_STANDARD_FORM, \
	CFG_JOURNAL_TAG
	from invenio.bibdocfile import BibRecDocs, InvenioBibDocFileError
	from invenio.search_engine import get_record
	from invenio.bibtask import task_low_level_submission
	from invenio.bibrecord import record_delete_fields, record_xml_output, \
	create_record, record_get_field_instances, record_add_fields, \
	record_has_field
	from invenio.refextract_find import get_reference_section_beginning, \
	find_numeration_in_body
	from invenio.refextract_text import rebuild_reference_lines
	from invenio.refextract_config import CFG_REFEXTRACT_FILENAME
	from invenio.config import CFG_TMPSHAREDDIR


	class FullTextNotAvailable(Exception):
	"""Raised when we cannot access the document text"""


	class RecordHasReferences(Exception):
	"""Raised when
	* we asked to updated references for a record
	* we explicitely asked for not overwriting references for this record
	(via the appropriate function argument)
	* the record has references thus we cannot update them
	"""


	def extract_references_from_url_xml(url):
	"""Extract references from the pdf specified in the url

	The single parameter is the path to the pdf.
	It raises FullTextNotAvailable if the url gives a 404
	The result is given in marcxml.
	"""
	filename, dummy = urlretrieve(url)
	try:
	try:
	marcxml = extract_references_from_file_xml(filename)
	except IOError, err:
	if err.code == 404:
	raise FullTextNotAvailable()
	else:
	raise
	finally:
	os.remove(filename)
	return marcxml


	-def extract_references_from_file_xml(path, recid=1):
	+def extract_references_from_file_xml(path, recid=None):
	"""Extract references from a local pdf file

	The single parameter is the path to the file
	It raises FullTextNotAvailable if the file does not exist
	The result is given in marcxml.
	"""
	+ return extract_references_from_file(path=path, recid=recid).to_xml()
	+
	+
	+def extract_references_from_file(path, recid=None):
	+ """Extract references from a local pdf file
	+
	+ The single parameter is the path to the file
	+ It raises FullTextNotAvailable if the file does not exist
	+ The result is given as a bibrecord class.
	+ """
	if not os.path.isfile(path):
	raise FullTextNotAvailable()

	docbody, dummy = get_plaintext_document_body(path)
	reflines, dummy, dummy = extract_references_from_fulltext(docbody)
	if not len(reflines):
	docbody, dummy = get_plaintext_document_body(path, keep_layout=True)
	reflines, dummy, dummy = extract_references_from_fulltext(docbody)

	return parse_references(reflines, recid=recid)


	-def extract_references_from_string_xml(source, is_only_references=True):
	+def extract_references_from_string_xml(source,
	+ is_only_references=True,
	+ recid=None):
	+ """Extract references from a string
	+
	+ The single parameter is the document
	+ The result is given as a bibrecord class.
	+ """
	+ r = extract_references_from_string(source=source,
	+ is_only_references=is_only_references,
	+ recid=recid)
	+ return r.to_xml()
	+
	+
	+def extract_references_from_string(source,
	+ is_only_references=True,
	+ recid=None):
	"""Extract references from a string

	The single parameter is the document
	The result is given in marcxml.
	"""
	docbody = source.split('\n')
	if not is_only_references:
	reflines, dummy, dummy = extract_references_from_fulltext(docbody)
	else:
	refs_info = get_reference_section_beginning(docbody)
	if not refs_info:
	refs_info, dummy = find_numeration_in_body(docbody)
	refs_info['start_line'] = 0
	refs_info['end_line'] = len(docbody) - 1,

	reflines = rebuild_reference_lines(docbody, refs_info['marker_pattern'])
	- return parse_references(reflines)
	+ return parse_references(reflines, recid=recid)


	def extract_references_from_record_xml(recid):
	"""Extract references from a record id

	The single parameter is the document
	The result is given in marcxml.
	"""
	path = look_for_fulltext(recid)
	if not path:
	raise FullTextNotAvailable()

	return extract_references_from_file_xml(path, recid=recid)


	def replace_references(recid):
	"""Replace references for a record

	The record itself is not updated, the marc xml of the document with updated
	references is returned

	Parameters:
	* recid: the id of the record
	"""
	# Parse references
	references_xml = extract_references_from_record_xml(recid)
	- references = create_record(references_xml.encode('utf-8'))
	+ references = create_record(references_xml)
	# Record marc xml
	record = get_record(recid)

	if references[0]:
	fields_to_add = record_get_field_instances(references[0],
	tag='999',
	ind1='%',
	ind2='%')
	# Replace 999 fields
	record_delete_fields(record, '999')
	record_add_fields(record, '999', fields_to_add)
	# Update record references
	out_xml = record_xml_output(record)
	else:
	out_xml = None

	return out_xml


	def update_references(recid, overwrite=True):
	"""Update references for a record

	First, we extract references from a record.
	Then, we are not updating the record directly but adding a bibupload
	task in -c mode which takes care of updating the record.

	Parameters:
	* recid: the id of the record
	"""

	if not overwrite:
	# Check for references in record
	record = get_record(recid)
	if record and record_has_field(record, '999'):
	- raise RecordHasReferences('Record has references and overwrite ' \
	+ raise RecordHasReferences('Record has references and overwrite '
	'mode is disabled: %s' % recid)

	if get_fieldvalues(recid, '999C59'):
	raise RecordHasReferences('Record has been curated: %s' % recid)

	# Parse references
	references_xml = extract_references_from_record_xml(recid)

	# Save new record to file
	(temp_fd, temp_path) = mkstemp(prefix=CFG_REFEXTRACT_FILENAME,
	dir=CFG_TMPSHAREDDIR)
	temp_file = os.fdopen(temp_fd, 'w')
	- temp_file.write(references_xml.encode('utf-8'))
	+ temp_file.write(references_xml)
	temp_file.close()

	# Update record
	task_low_level_submission('bibupload', 'refextract', '-P', '5',
	'-c', temp_path)


	def list_pdfs(recid):
	rec_info = BibRecDocs(recid)
	docs = rec_info.list_bibdocs()

	for doc in docs:
	for ext in ('pdf', 'pdfa', 'PDF'):
	try:
	yield doc.get_file(ext)
	except InvenioBibDocFileError:
	pass


	def get_pdf_doc(recid):
	try:
	doc = list_pdfs(recid).next()
	except StopIteration:
	doc = None

	return doc


	def look_for_fulltext(recid):
	doc = get_pdf_doc(recid)

	path = None
	if doc:
	path = doc.get_full_path()

	return path


	def record_has_fulltext(recid):
	"""Checks if we can access the fulltext for the given recid"""
	path = look_for_fulltext(recid)
	return path is not None


	def search_from_reference(text):
	"""Convert a raw reference to a search query

	Called by the search engine to convert a raw reference:
	find rawref John, JINST 4 (1994) 45
	is converted to
	journal:"JINST,4,45"
	"""
	field = ''
	pattern = ''

	kbs = get_kbs()
	references, dummy_m, dummy_c, dummy_co = parse_reference_line(text, kbs)

	for elements in references:
	for el in elements:
	if el['type'] == 'JOURNAL':
	field = 'journal'
	pattern = CFG_JOURNAL_PUBINFO_STANDARD_FORM \
	.replace(CFG_JOURNAL_TAG.replace('%', 'p'), el['title']) \
	.replace(CFG_JOURNAL_TAG.replace('%', 'v'), el['volume']) \
	.replace(CFG_JOURNAL_TAG.replace('%', 'c'), el['page']) \
	.replace(CFG_JOURNAL_TAG.replace('%', 'y'), el['year'])
	break
	elif el['type'] == 'REPORTNUMBER':
	field = 'report'
	pattern = el['report_num']
	break

	return field, pattern.encode('utf-8')
	diff --git a/modules/docextract/lib/refextract_cli.py b/modules/docextract/lib/refextract_cli.py
	index 43a58cdeb..60f9219ee 100644
	--- a/modules/docextract/lib/refextract_cli.py
	+++ b/modules/docextract/lib/refextract_cli.py
	@@ -1,316 +1,262 @@
	# -- coding: utf-8 --
	##
	## This file is part of Invenio.
	## Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010, 2011 CERN.
	##
	## Invenio is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## Invenio is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	"""This is file handles the command line interface

	* We parse the options for both daemon and standalone usage
	* When using using the standalone mode, we use the function "main"
	defined here to begin the extraction of references
	"""

	__revision__ = "$Id$"

	import traceback
	import optparse
	import sys
	-import os

	-from invenio.refextract_config import \
	- CFG_REFEXTRACT_XML_VERSION, \
	- CFG_REFEXTRACT_XML_COLLECTION_OPEN, \
	- CFG_REFEXTRACT_XML_COLLECTION_CLOSE
	+from invenio.docextract_record import print_records
	from invenio.docextract_utils import write_message, setup_loggers
	from invenio.bibtask import task_update_progress
	-from invenio.refextract_api import extract_references_from_file_xml, \
	- extract_references_from_string_xml
	+from invenio.refextract_api import extract_references_from_file, \
	+ extract_references_from_string

	# Is refextract running standalone? (Default = yes)
	RUNNING_INDEPENDENTLY = False

	DESCRIPTION = ""

	# Help message, used by bibtask's 'task_init()' and 'usage()'
	HELP_MESSAGE = """
	-i, --inspire Output journal standard reference form in the INSPIRE
	recognised format: [series]volume,page.
	--kb-journals Manually specify the location of a journal title
	knowledge-base file.
	--kb-journals-re Manually specify the location of a journal title regexps
	knowledge-base file.
	--kb-report-numbers Manually specify the location of a report number
	knowledge-base file.
	--kb-authors Manually specify the location of an author
	knowledge-base file.
	--kb-books Manually specify the location of a book
	knowledge-base file.
	--no-overwrite Do not touch record if it already has references

	Standalone Refextract options:
	-o, --out Write the extracted references, in xml form, to a file
	rather than standard output.
	--dictfile Write statistics about all matched title abbreviations
	(i.e. LHS terms in the titles knowledge base) to a file.
	--output-raw-refs Output raw references, as extracted from the document.
	No MARC XML mark-up - just each extracted line, prefixed
	by the recid of the document that it came from.
	--raw-references Treat the input file as pure references. i.e. skip the
	stage of trying to locate the reference section within a
	document and instead move to the stage of recognition
	and standardisation of citations within lines.
	"""

	USAGE_MESSAGE = """Usage: docextract [options] file1 [file2 ...]
	Command options: %s
	Examples:
	docextract -o /home/chayward/refs.xml /home/chayward/thesis.pdf
	""" % HELP_MESSAGE


	def get_cli_options():
	"""Get the various arguments and options from the command line and populate
	a dictionary of cli_options.
	@return: (tuple) of 2 elements. First element is a dictionary of cli
	options and flags, set as appropriate; Second element is a list of cli
	arguments.
	"""
	parser = optparse.OptionParser(description=DESCRIPTION,
	usage=USAGE_MESSAGE,
	add_help_option=False)
	# Display help and exit
	parser.add_option('-h', '--help', action='store_true')
	# Display version and exit
	parser.add_option('-V', '--version', action='store_true')
	# Output recognised journal titles in the Inspire compatible format
	parser.add_option('-i', '--inspire', action='store_true')
	# The location of the report number kb requested to override
	# a 'configuration file'-specified kb
	parser.add_option('--kb-report-numbers', dest='kb_report_numbers')
	# The location of the journal title kb requested to override
	# a 'configuration file'-specified kb, holding
	# 'seek---replace' terms, used when matching titles in references
	parser.add_option('--kb-journals', dest='kb_journals')
	parser.add_option('--kb-journals-re', dest='kb_journals_re')
	# The location of the author kb requested to override
	parser.add_option('--kb-authors', dest='kb_authors')
	# The location of the author kb requested to override
	parser.add_option('--kb-books', dest='kb_books')
	# The location of the author kb requested to override
	parser.add_option('--kb-conferences', dest='kb_conferences')
	# Write out the statistics of all titles matched during the
	# extraction job to the specified file
	parser.add_option('--dictfile')
	# Write out MARC XML references to the specified file
	parser.add_option('-o', '--out', dest='xmlfile')
	# Handle verbosity
	parser.add_option('-v', '--verbose', type=int, dest='verbosity', default=0)
	# Output a raw list of refs
	parser.add_option('--output-raw-refs', action='store_true',
	dest='output_raw')
	# Treat input as pure reference lines:
	# (bypass the reference section lookup)
	parser.add_option('--raw-references', action='store_true',
	dest='treat_as_reference_section')
	return parser.parse_args()


	def halt(err=StandardError, msg=None, exit_code=1):
	""" Stop extraction, and deal with the error in the appropriate
	manner, based on whether Refextract is running in standalone or
	bibsched mode.
	@param err: (exception) The exception raised from an error, if any
	@param msg: (string) The brief error message, either displayed
	on the bibsched interface, or written to stderr.
	@param exit_code: (integer) Either 0 or 1, depending on the cause
	of the halting. This is only used when running standalone."""
	# If refextract is running independently, exit.
	# 'RUNNING_INDEPENDENTLY' is a global variable
	if RUNNING_INDEPENDENTLY:
	if msg:
	write_message(msg, stream=sys.stderr, verbose=0)
	sys.exit(exit_code)
	# Else, raise an exception so Bibsched will flag this task.
	else:
	if msg:
	# Update the status of refextract inside the Bibsched UI
	task_update_progress(msg.strip())
	raise err(msg)


	def usage(wmsg=None, err_code=0):
	"""Display a usage message for refextract on the standard error stream and
	then exit.
	@param wmsg: (string) some kind of brief warning message for the user.
	@param err_code: (integer) an error code to be passed to halt,
	which is called after the usage message has been printed.
	@return: None.
	"""
	if wmsg:
	wmsg = wmsg.strip()

	# Display the help information and the warning in the stderr stream
	# 'help_message' is global
	print >> sys.stderr, USAGE_MESSAGE
	# Output error message, either to the stderr stream also or
	# on the interface. Stop the extraction procedure
	halt(msg=wmsg, exit_code=err_code)


	def main(config, args, run):
	"""Main wrapper function for begin_extraction, and is
	always accessed in a standalone/independent way. (i.e. calling main
	will cause refextract to run in an independent mode)"""
	# Flag as running out of bibtask
	global RUNNING_INDEPENDENTLY
	RUNNING_INDEPENDENTLY = True

	if config.verbosity not in range(0, 10):
	usage("Error: Verbosity must be an integer between 0 and 10")

	setup_loggers(config.verbosity)

	if config.version:
	# version message and exit
	write_message(__revision__, verbose=0)
	halt(exit_code=0)

	if config.help:
	usage()

	if not args:
	# no files provided for reference extraction - error message
	usage("Error: No valid input file specified (file1 [file2 ...])")

	try:
	run(config, args)
	write_message("Extraction complete", verbose=2)
	except StandardError, e:
	# Remove extra '\n'
	write_message(traceback.format_exc()[:-1], verbose=9)
	write_message("Error: %s" % e, verbose=0)
	halt(exit_code=1)


	def extract_one(config, pdf_path):
	"""Extract references from one file"""
	-
	- # the document body is not empty:
	- # 2. If necessary, locate the reference section:
	+ # If necessary, locate the reference section:
	if config.treat_as_reference_section:
	docbody = open(pdf_path).read().decode('utf-8')
	- out = extract_references_from_string_xml(docbody)
	+ record = extract_references_from_string(docbody)
	else:
	write_message("* processing pdffile: %s" % pdf_path, verbose=2)
	- out = extract_references_from_file_xml(pdf_path)
	+ record = extract_references_from_file(pdf_path)

	- return out
	+ return record


	def begin_extraction(config, files):
	"""Starts the core extraction procedure. [Entry point from main]

	Only refextract_daemon calls this directly, from _task_run_core()
	@param daemon_cli_options: contains the pre-assembled list of cli flags
	and values processed by the Refextract Daemon. This is full only when
	called as a scheduled bibtask inside bibsched.
	"""
	- # Store xml records here
	- output = []
	+ # Store records here
	+ records = []

	for num, path in enumerate(files):
	# Announce the document extraction number
	write_message("Extracting %d of %d" % (num + 1, len(files)),
	verbose=1)
	- out = extract_one(config, path)
	- output.append(out)
	+ # Parse references
	+ rec = extract_one(config, path)
	+ records.append(rec)

	# Write our references
	- write_references(config, output)
	-
	+ write_references(config, records)

	-def write_references(config, xml_references):
	- """Write marcxml to file

	- * Output xml header
	- * Output collection opening tag
	- * Output xml for each record
	- * Output collection closing tag
	- """
	+def write_references(config, records):
	+ """Write in marcxml"""
	if config.xmlfile:
	ofilehdl = open(config.xmlfile, 'w')
	else:
	ofilehdl = sys.stdout

	+ if config.xmlfile:
	+ for rec in records:
	+ for subfield in rec.find_subfields('999C5m'):
	+ if len(subfield.value) > 2048:
	+ subfield.value = subfield.value[:2048]
	+
	try:
	- print >>ofilehdl, CFG_REFEXTRACT_XML_VERSION.encode("utf-8")
	- print >>ofilehdl, CFG_REFEXTRACT_XML_COLLECTION_OPEN.encode("utf-8")
	- for out in xml_references:
	- print >>ofilehdl, out.encode("utf-8")
	- print >>ofilehdl, CFG_REFEXTRACT_XML_COLLECTION_CLOSE.encode("utf-8")
	+ xml = print_records(records)
	+ print >>ofilehdl, xml
	ofilehdl.flush()
	except IOError, err:
	- write_message("%s\n%s\n" % (config.xmlfile, err), \
	+ write_message("%s\n%s\n" % (config.xmlfile, err),
	sys.stderr, verbose=0)
	- halt(err=IOError, msg="Error: Unable to write to '%s'" \
	+ halt(err=IOError, msg="Error: Unable to write to '%s'"
	% config.xmlfile, exit_code=1)
	-
	- if config.xmlfile:
	- ofilehdl.close()
	- # limit m tag data to something less than infinity
	- limit_m_tags(config.xmlfile, 2048)
	-
	-
	-def limit_m_tags(xml_file, length_limit):
	- """Limit size of miscellaneous tags"""
	- temp_xml_file = xml_file + '.temp'
	- try:
	- ofilehdl = open(xml_file, 'r')
	- except IOError:
	- write_message("***%s\n" % xml_file, verbose=0)
	- raise IOError("Error: Unable to read from '%s'" % xml_file)
	- try:
	- nfilehdl = open(temp_xml_file, 'w')
	- except IOError:
	- write_message("***%s\n" % temp_xml_file, verbose=0)
	- raise IOError("Error: Unable to write to '%s'" % temp_xml_file)
	-
	- for line in ofilehdl:
	- line_dec = line.decode("utf-8")
	- start_ind = line_dec.find('<subfield code="m">')
	- if start_ind != -1:
	- # This line is an "m" line:
	- last_ind = line_dec.find('</subfield>')
	- if last_ind != -1:
	- # This line contains the end-tag for the "m" section
	- leng = last_ind - start_ind - 19
	- if leng > length_limit:
	- # want to truncate on a blank to avoid problems..
	- end = start_ind + 19 + length_limit
	- for lett in range(end - 1, last_ind):
	- xx = line_dec[lett:lett+1]
	- if xx == ' ':
	- break
	- else:
	- end += 1
	- middle = line_dec[start_ind+19:end-1]
	- line_dec = start_ind * ' ' + '<subfield code="m">' + \
	- middle + ' !Data truncated! ' + '</subfield>\n'
	- nfilehdl.write("%s" % line_dec.encode("utf-8"))
	- nfilehdl.close()
	- # copy back to original file name
	- os.rename(temp_xml_file, xml_file)
	diff --git a/modules/docextract/lib/refextract_config.py b/modules/docextract/lib/refextract_config.py
	index f8b9d103c..a4dcb4ad4 100644
	--- a/modules/docextract/lib/refextract_config.py
	+++ b/modules/docextract/lib/refextract_config.py
	@@ -1,116 +1,127 @@
	# -- coding: utf-8 --
	##
	## This file is part of Invenio.
	## Copyright (C) 2005, 2006, 2007, 2008, 2010, 2011 CERN.
	##
	## Invenio is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## Invenio is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	"""RefExtract configuration"""


	from invenio.config import CFG_VERSION, CFG_ETCDIR

	# pylint: disable=C0301

	+CFG_REFEXTRACT_VERSION_NUM = '1.5.32'
	# Version number:
	-CFG_REFEXTRACT_VERSION = "Invenio/%s refextract/%s" % (CFG_VERSION, '1.4')
	+CFG_REFEXTRACT_VERSION = "Invenio/%s refextract/%s" \
	+ % (CFG_VERSION, CFG_REFEXTRACT_VERSION_NUM)
	# Module config directory
	CFG_CONF_DIR = '%s/docextract' % CFG_ETCDIR

	CFG_REFEXTRACT_KBS = {
	'journals' : "%s/journal-titles.kb" % CFG_CONF_DIR,
	'journals-re' : "%s/journal-titles-re.kb" % CFG_CONF_DIR,
	'report-numbers' : "%s/report-numbers.kb" % CFG_CONF_DIR,
	'authors' : "%s/authors.kb" % CFG_CONF_DIR,
	'collaborations' : "%s/collaborations.kb" % CFG_CONF_DIR,
	'books' : "%s/books.kb" % CFG_CONF_DIR,
	'conferences' : "%s/conferences.kb" % CFG_CONF_DIR,
	'publishers' : "%s/publishers.kb" % CFG_CONF_DIR,
	'special-journals': "%s/special-journals.kb" % CFG_CONF_DIR,
	}

	# Prefix for temp files
	CFG_REFEXTRACT_FILENAME = "refextract"

	## MARC Fields and subfields used by refextract:

	# Reference fields:
	-CFG_REFEXTRACT_CTRL_FIELD_RECID = "001" # control-field recid
	-CFG_REFEXTRACT_TAG_ID_REFERENCE = "999" # ref field tag
	-CFG_REFEXTRACT_IND1_REFERENCE = "C" # ref field ind1
	-CFG_REFEXTRACT_IND2_REFERENCE = "5" # ref field ind2
	-CFG_REFEXTRACT_SUBFIELD_MARKER = "o" # ref marker subfield
	-CFG_REFEXTRACT_SUBFIELD_MISC = "m" # ref misc subfield
	-CFG_REFEXTRACT_SUBFIELD_DOI = "a" # ref DOI subfield (NEW)
	-CFG_REFEXTRACT_SUBFIELD_REPORT_NUM = "r" # ref reportnum subfield
	-CFG_REFEXTRACT_SUBFIELD_TITLE = "s" # ref journal subfield
	-CFG_REFEXTRACT_SUBFIELD_URL = "u" # ref url subfield
	-CFG_REFEXTRACT_SUBFIELD_URL_DESCR = "z" # ref url-text subfield
	-CFG_REFEXTRACT_SUBFIELD_AUTH = "h" # ref author subfield
	-CFG_REFEXTRACT_SUBFIELD_QUOTED = "t" # ref title subfield
	-CFG_REFEXTRACT_SUBFIELD_ISBN = "i" # ref isbn subfield
	-CFG_REFEXTRACT_SUBFIELD_PUBLISHER = "p" # ref publisher subfield
	-CFG_REFEXTRACT_SUBFIELD_YEAR = "y" # ref publisher subfield
	-CFG_REFEXTRACT_SUBFIELD_BOOK = "xbook" # ref book subfield
	+CFG_REFEXTRACT_FIELDS = {
	+ 'misc': 'm',
	+ 'linemarker': 'o',
	+ 'doi': 'a',
	+ 'reportnumber': 'r',
	+ 'journal': 's',
	+ 'url': 'u',
	+ 'urldesc': 'z',
	+ 'author': 'h',
	+ 'title': 't',
	+ 'isbn': 'i',
	+ 'publisher': 'p',
	+ 'year': 'y',
	+ 'collaboration': 'c',
	+ 'recid': '0',
	+}
	+
	+CFG_REFEXTRACT_TAG_ID_REFERENCE = "999" # ref field tag
	+CFG_REFEXTRACT_IND1_REFERENCE = "C" # ref field ind1
	+CFG_REFEXTRACT_IND2_REFERENCE = "5" # ref field ind2

	## refextract statistics fields:
	-CFG_REFEXTRACT_TAG_ID_EXTRACTION_STATS = "999" # ref-stats tag
	-CFG_REFEXTRACT_IND1_EXTRACTION_STATS = "C" # ref-stats ind1
	-CFG_REFEXTRACT_IND2_EXTRACTION_STATS = "6" # ref-stats ind2
	+CFG_REFEXTRACT_TAG_ID_EXTRACTION_STATS = "999C6" # ref-stats tag
	+
	CFG_REFEXTRACT_SUBFIELD_EXTRACTION_STATS = "a" # ref-stats subfield
	CFG_REFEXTRACT_SUBFIELD_EXTRACTION_TIME = "t" # ref-stats time subfield
	CFG_REFEXTRACT_SUBFIELD_EXTRACTION_VERSION = "v" # ref-stats version subfield
	## Internal tags are used by refextract to mark-up recognised citation
	-## information. These are the "closing tags:
	-CFG_REFEXTRACT_MARKER_CLOSING_REPORT_NUM = r"</cds.REPORTNUMBER>"
	-CFG_REFEXTRACT_MARKER_CLOSING_TITLE = r"</cds.JOURNAL>"
	-CFG_REFEXTRACT_MARKER_CLOSING_TITLE_IBID = r"</cds.JOURNALibid>"
	-CFG_REFEXTRACT_MARKER_CLOSING_SERIES = r"</cds.SER>"
	-CFG_REFEXTRACT_MARKER_CLOSING_VOLUME = r"</cds.VOL>"
	-CFG_REFEXTRACT_MARKER_CLOSING_YEAR = r"</cds.YR>"
	-CFG_REFEXTRACT_MARKER_CLOSING_PAGE = r"</cds.PG>"
	-CFG_REFEXTRACT_MARKER_CLOSING_QUOTED = r"</cds.QUOTED>"
	-CFG_REFEXTRACT_MARKER_CLOSING_ISBN = r"</cds.ISBN>"
	-CFG_REFEXTRACT_MARKER_CLOSING_ISBN = r"</cds.PUBLISHER>"
	+## information.
	+CFG_REFEXTRACT_MARKER_OPENING_REPORT_NUM = r"<cds.REPORTNUMBER>"
	+CFG_REFEXTRACT_MARKER_OPENING_TITLE = r"<cds.JOURNAL>"
	+CFG_REFEXTRACT_MARKER_OPENING_TITLE_IBID = r"<cds.JOURNALibid>"
	+CFG_REFEXTRACT_MARKER_OPENING_SERIES = r"<cds.SER>"
	+CFG_REFEXTRACT_MARKER_OPENING_VOLUME = r"<cds.VOL>"
	+CFG_REFEXTRACT_MARKER_OPENING_YEAR = r"<cds.YR>"
	+CFG_REFEXTRACT_MARKER_OPENING_PAGE = r"<cds.PG>"
	+CFG_REFEXTRACT_MARKER_OPENING_QUOTED = r"<cds.QUOTED>"
	+CFG_REFEXTRACT_MARKER_OPENING_ISBN = r"<cds.ISBN>"
	+CFG_REFEXTRACT_MARKER_OPENING_PUBLISHER = r"<cds.PUBLISHER>"
	+CFG_REFEXTRACT_MARKER_OPENING_COLLABORATION = r"<cds.COLLABORATION>"
	+
	+# These are the "closing tags:
	+CFG_REFEXTRACT_MARKER_CLOSING_REPORT_NUM = r"</cds.REPORTNUMBER>"
	+CFG_REFEXTRACT_MARKER_CLOSING_TITLE = r"</cds.JOURNAL>"
	+CFG_REFEXTRACT_MARKER_CLOSING_TITLE_IBID = r"</cds.JOURNALibid>"
	+CFG_REFEXTRACT_MARKER_CLOSING_SERIES = r"</cds.SER>"
	+CFG_REFEXTRACT_MARKER_CLOSING_VOLUME = r"</cds.VOL>"
	+CFG_REFEXTRACT_MARKER_CLOSING_YEAR = r"</cds.YR>"
	+CFG_REFEXTRACT_MARKER_CLOSING_PAGE = r"</cds.PG>"
	+CFG_REFEXTRACT_MARKER_CLOSING_QUOTED = r"</cds.QUOTED>"
	+CFG_REFEXTRACT_MARKER_CLOSING_ISBN = r"</cds.ISBN>"
	+CFG_REFEXTRACT_MARKER_CLOSING_PUBLISHER = r"</cds.PUBLISHER>"
	+CFG_REFEXTRACT_MARKER_CLOSING_COLLABORATION = r"</cds.COLLABORATION>"

	## Of the form '</cds.AUTHxxxx>' only
	CFG_REFEXTRACT_MARKER_CLOSING_AUTHOR_STND = r"</cds.AUTHstnd>"
	CFG_REFEXTRACT_MARKER_CLOSING_AUTHOR_ETAL = r"</cds.AUTHetal>"
	CFG_REFEXTRACT_MARKER_CLOSING_AUTHOR_INCL = r"</cds.AUTHincl>"

	-## XML Record and collection opening/closing tags:
	-CFG_REFEXTRACT_XML_VERSION = u"""<?xml version="1.0" encoding="UTF-8"?>"""
	-CFG_REFEXTRACT_XML_COLLECTION_OPEN = u"""<collection xmlns="http://www.loc.gov/MARC21/slim">"""
	-CFG_REFEXTRACT_XML_COLLECTION_CLOSE = u"""</collection>"""
	-CFG_REFEXTRACT_XML_RECORD_OPEN = u"<record>"
	-CFG_REFEXTRACT_XML_RECORD_CLOSE = u"</record>"
	-
	## The minimum length of a reference's misc text to be deemed insignificant.
	## when comparing misc text with semi-colon defined sub-references.
	## Values higher than this value reflect meaningful misc text.
	## Hence, upon finding a correct semi-colon, but having current misc text
	## length less than this value (without other meaningful reference objects:
	## report numbers, titles...) then no split will occur.
	## (A higher value will increase splitting strictness. i.e. Fewer splits)
	CGF_REFEXTRACT_SEMI_COLON_MISC_TEXT_SENSITIVITY = 60

	## The length of misc text between two adjacent authors which is
	## deemed as insignificant. As such, when misc text of a length less
	## than this value is found, then the latter author group is dumped into misc.
	## (A higher value will increase splitting strictness. i.e. Fewer splits)
	CGF_REFEXTRACT_ADJACENT_AUTH_MISC_SEPARATION = 10

	## Maximum number of lines for a citation before it is considered invalid
	CFG_REFEXTRACT_MAX_LINES = 25
	diff --git a/modules/docextract/lib/refextract_engine.py b/modules/docextract/lib/refextract_engine.py
	index 01f917d5d..63d6f50e0 100644
	--- a/modules/docextract/lib/refextract_engine.py
	+++ b/modules/docextract/lib/refextract_engine.py
	@@ -1,1069 +1,1209 @@
	# -- coding: utf-8 --
	##
	## This file is part of Invenio.
	## Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010, 2011 CERN.
	##
	## Invenio is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## Invenio is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	"""This is the main body of refextract. It is used to extract references from
	fulltext PDF documents.
	"""

	__revision__ = "$Id$"

	import re
	import os
	import subprocess
	from itertools import chain

	from invenio.refextract_config import \
	CFG_REFEXTRACT_MARKER_CLOSING_REPORT_NUM, \
	CFG_REFEXTRACT_MARKER_CLOSING_AUTHOR_INCL, \
	CFG_REFEXTRACT_MARKER_CLOSING_AUTHOR_STND, \
	CFG_REFEXTRACT_MARKER_CLOSING_VOLUME, \
	CFG_REFEXTRACT_MARKER_CLOSING_YEAR, \
	CFG_REFEXTRACT_MARKER_CLOSING_PAGE, \
	CFG_REFEXTRACT_MARKER_CLOSING_TITLE_IBID, \
	CFG_REFEXTRACT_MARKER_CLOSING_AUTHOR_ETAL, \
	CFG_REFEXTRACT_MARKER_CLOSING_TITLE, \
	CFG_REFEXTRACT_MARKER_CLOSING_SERIES

	# make refextract runnable without requiring the full Invenio installation:
	from invenio.config import CFG_PATH_GFILE

	from invenio.refextract_tag import tag_reference_line, \
	- sum_2_dictionaries, identify_and_tag_DOI, identify_and_tag_URLs
	-from invenio.refextract_xml import create_xml_record, \
	- build_xml_citations
	+ sum_2_dictionaries, \
	+ identify_and_tag_DOI, \
	+ identify_and_tag_URLs, \
	+ find_numeration, \
	+ extract_series_from_volume
	+from invenio.refextract_record import build_record, \
	+ build_references
	from invenio.docextract_pdf import convert_PDF_to_plaintext
	from invenio.docextract_utils import write_message
	from invenio.refextract_kbs import get_kbs
	from invenio.refextract_linker import find_referenced_recid
	from invenio.refextract_re import get_reference_line_numeration_marker_patterns, \
	regex_match_list, \
	re_tagged_citation, \
	re_numeration_no_ibid_txt, \
	re_roman_numbers, \
	re_recognised_numeration_for_title_plus_series


	description = """
	Refextract tries to extract the reference section from a full-text document.
	Extracted reference lines are processed and any recognised citations are
	marked up using MARC XML. Recognises author names, URL's, DOI's, and also
	journal titles and report numbers as per the relevant knowledge bases. Results
	are output to the standard output stream as default, or instead to an xml file.

	"""

	# General initiation tasks:

	# components relating to the standardisation and
	# recognition of citations in reference lines:


	def remove_reference_line_marker(line):
	"""Trim a reference line's 'marker' from the beginning of the line.
	@param line: (string) - the reference line.
	@return: (tuple) containing two strings:
	+ The reference line's marker (or if there was not one,
	a 'space' character.
	+ The reference line with it's marker removed from the
	beginning.
	"""
	# Get patterns to identify reference-line marker patterns:
	marker_patterns = get_reference_line_numeration_marker_patterns()
	line = line.lstrip()

	marker_match = regex_match_list(line, marker_patterns)

	if marker_match is not None:
	# found a marker:
	marker_val = marker_match.group(u'mark')
	# trim the marker from the start of the line:
	line = line[marker_match.end():].lstrip()
	else:
	marker_val = u" "
	return (marker_val, line)


	def roman2arabic(num):
	"""Convert numbers from roman to arabic

	This function expects a string like XXII
	and outputs an integer
	"""
	t = 0
	p = 0
	for r in num:
	n = 10 ** (205558 % ord(r) % 7) % 9995
	t += n - 2 * p % n
	p = n
	return t


	## Transformations

	def format_volume(citation_elements):
	"""format volume number (roman numbers to arabic)

	When the volume number is expressed in roman numbers (CXXII),
	they are converted to their equivalent in arabic numbers (42)
	"""
	re_roman = re.compile(re_roman_numbers + u'$', re.UNICODE)
	for el in citation_elements:
	if el['type'] == 'JOURNAL' and re_roman.match(el['volume']):
	el['volume'] = str(roman2arabic(el['volume'].upper()))
	return citation_elements


	def handle_special_journals(citation_elements, kbs):
	"""format special journals (like JHEP) volume number

	JHEP needs the volume number prefixed with the year
	e.g. JHEP 0301 instead of JHEP 01
	"""
	for el in citation_elements:
	if el['type'] == 'JOURNAL' and el['title'] in kbs['special_journals'] \
	and re.match('\d{1,2}$', el['volume']):

	# Sometimes the page is omitted and the year is written in its place
	# We can never be sure but it's very likely that page > 1900 is
	# actually a year, so we skip this reference
	if el['year'] == '' and re.match('(19\|20)\d{2}$', el['page']):
	el['type'] = 'MISC'
	el['misc_txt'] = "%s,%s,%s" \
	% (el['title'], el['volume'], el['page'])

	el['volume'] = el['year'][-2:] + '%02d' % int(el['volume'])

	return citation_elements


	def format_report_number(citation_elements):
	"""Format report numbers that are missing a dash

	e.g. CERN-LCHH2003-01 to CERN-LHCC-2003-01
	"""
	re_report = re.compile(ur'^(?P<name>[A-Z-]+)(?P<nums>[\d-]+)$', re.UNICODE)
	for el in citation_elements:
	if el['type'] == 'REPORTNUMBER':
	m = re_report.match(el['report_num'])
	if m:
	name = m.group('name')
	if not name.endswith('-'):
	el['report_num'] = m.group('name') + '-' + m.group('nums')
	return citation_elements


	def format_hep(citation_elements):
	"""Format hep-th report numbers with a dash

	e.g. replaces hep-th-9711200 with hep-th/9711200
	"""
	prefixes = ('astro-ph-', 'hep-th-', 'hep-ph-', 'hep-ex-', 'hep-lat-',
	'math-ph-')
	for el in citation_elements:
	if el['type'] == 'REPORTNUMBER':
	for p in prefixes:
	if el['report_num'].startswith(p):
	el['report_num'] = el['report_num'][:len(p) - 1] + '/' + \
	el['report_num'][len(p):]
	return citation_elements


	def format_author_ed(citation_elements):
	"""Standardise to (ed.) and (eds.)

	e.g. Remove extra space in (ed. )
	"""
	for el in citation_elements:
	if el['type'] == 'AUTH':
	el['auth_txt'] = el['auth_txt'].replace('(ed. )', '(ed.)')
	el['auth_txt'] = el['auth_txt'].replace('(eds. )', '(eds.)')
	return citation_elements


	def look_for_books(citation_elements, kbs):
	"""Look for books in our kb

	Create book tags by using the authors and the title to find books
	in our knowledge base
	"""
	authors = None
	title = None
	for el in citation_elements:
	if el['type'] == 'AUTH':
	authors = el
	break
	for el in citation_elements:
	if el['type'] == 'QUOTED':
	title = el
	break

	if authors and title:
	if title['title'].upper() in kbs['books']:
	line = kbs['books'][title['title'].upper()]
	el = {'type': 'BOOK',
	'misc_txt': '',
	'authors': line[0],
	'title': line[1],
	'year': line[2].strip(';')}
	citation_elements.append(el)
	citation_elements.remove(title)

	return citation_elements


	def split_volume_from_journal(citation_elements):
	"""Split volume from journal title

	We need this because sometimes the volume is attached to the journal title
	instead of the volume. In those cases we move it here from the title to the
	volume
	"""
	for el in citation_elements:
	if el['type'] == 'JOURNAL' and ';' in el['title']:
	el['title'], series = el['title'].rsplit(';', 1)
	el['volume'] = series + el['volume']
	return citation_elements


	def remove_b_for_nucl_phys(citation_elements):
	"""Removes b from the volume of some journals

	Removes the B from the volume for Nucl.Phys.Proc.Suppl. because in INSPIRE
	that journal is handled differently.
	"""
	for el in citation_elements:
	if el['type'] == 'JOURNAL' and el['title'] == 'Nucl.Phys.Proc.Suppl.' \
	and 'volume' in el \
	and (el['volume'].startswith('b') or el['volume'].startswith('B')):
	el['volume'] = el['volume'][1:]
	return citation_elements


	def mangle_volume(citation_elements):
	"""Make sure the volume letter is before the volume number

	e.g. transforms 100B to B100
	"""
	volume_re = re.compile(ur"(\d+)([A-Z])", re.U\|re.I)
	for el in citation_elements:
	if el['type'] == 'JOURNAL':
	matches = volume_re.match(el['volume'])
	if matches:
	el['volume'] = matches.group(2) + matches.group(1)

	return citation_elements


	def balance_authors(splitted_citations, new_elements):
	if not splitted_citations:
	return

	last_citation = splitted_citations[-1]
	current_citation = new_elements

	if last_citation[-1]['type'] == 'AUTH' \
	and sum([1 for cit in last_citation if cit['type'] == 'AUTH']) > 1:
	el = last_citation.pop()
	current_citation.insert(0, el)


	+def associate_recids(citation_elements):
	+ for el in citation_elements:
	+ try:
	+ el['recid'] = find_referenced_recid(el).pop()
	+ except (IndexError, KeyError):
	+ el['recid'] = None
	+ return citation_elements
	+
	+
	+def associate_recids_catchup(splitted_citations):
	+ for citation_elements in splitted_citations:
	+ associate_recids(citation_elements)
	+
	+
	def split_citations(citation_elements):
	"""Split a citation line in multiple citations

	We handle the case where the author has put 2 citations in the same line
	but split with ; or some other method.
	"""
	splitted_citations = []
	new_elements = []
	current_recid = None
	+ current_doi = None

	def check_ibid(current_elements, trigger_el):
	+ for el in new_elements:
	+ if el['type'] == 'AUTH':
	+ return
	+
	# Check for ibid
	if trigger_el.get('is_ibid', False):
	if splitted_citations:
	els = chain(reversed(current_elements),
	reversed(splitted_citations[-1]))
	else:
	els = reversed(current_elements)
	for el in els:
	if el['type'] == 'AUTH':
	new_elements.append(el.copy())
	break

	def start_new_citation():
	"""Start new citation"""
	splitted_citations.append(new_elements[:])
	del new_elements[:]

	- to_merge = None
	for el in citation_elements:
	- if to_merge:
	- el['misc_txt'] = to_merge + " " + el.get('misc_txt', '')
	- to_merge = None
	-
	try:
	- el_recid = find_referenced_recid(el).pop()
	- except (IndexError, KeyError):
	+ el_recid = el['recid']
	+ except IndexError:
	el_recid = None

	if current_recid and el_recid and current_recid == el_recid:
	# Do not start a new citation
	pass
	- elif current_recid and el_recid and current_recid != el_recid:
	+ elif current_recid and el_recid and current_recid != el_recid \
	+ or current_doi and el['type'] == 'DOI' and \
	+ current_doi != el['doi_string']:
	start_new_citation()
	# Some authors may be found in the previous citation
	balance_authors(splitted_citations, new_elements)
	- elif ';' in el['misc_txt'] and valid_citation(new_elements):
	- el['misc_txt'], to_merge = el['misc_txt'].rsplit(';', 1)
	+ elif ';' in el['misc_txt']:
	+ misc_txt, el['misc_txt'] = el['misc_txt'].split(';', 1)
	+ if misc_txt:
	+ new_elements.append({'type': 'MISC',
	+ 'misc_txt': misc_txt})
	start_new_citation()
	+ while ';' in el['misc_txt']:
	+ misc_txt, el['misc_txt'] = el['misc_txt'].split(';', 1)
	+ if misc_txt:
	+ new_elements.append({'type': 'MISC',
	+ 'misc_txt': misc_txt})
	+ start_new_citation()

	if el_recid:
	current_recid = el_recid

	+ if el['type'] == 'DOI':
	+ current_doi = el['doi_string']
	+
	check_ibid(new_elements, el)
	new_elements.append(el)

	- if to_merge:
	- new_elements[-1]['misc_txt'] += " " + to_merge
	- new_elements[-1]['misc_txt'] = new_elements[-1]['misc_txt'].strip()
	-
	splitted_citations.append(new_elements)

	- return splitted_citations
	+ return [el for el in splitted_citations if not empty_citation(el)]
	+
	+
	+def empty_citation(citation):
	+ els_to_remove = ('MISC', )
	+ for el in citation:
	+ if el['type'] not in els_to_remove:
	+ return False
	+ if el['misc_txt']:
	+ return False
	+ return True


	def valid_citation(citation):
	els_to_remove = ('MISC', )
	for el in citation:
	if el['type'] not in els_to_remove:
	return True
	return False


	def remove_invalid_references(splitted_citations):
	def add_misc(el, txt):
	if not el.get('misc_txt'):
	el['misc_txt'] = txt
	else:
	el['misc_txt'] += " " + txt

	splitted_citations = [citation for citation in splitted_citations \
	if citation]

	# We merge some elements in here which means it only makes sense when
	# we have at least 2 elements to merge together
	if len(splitted_citations) > 1:
	previous_citation = None
	for citation in splitted_citations:
	if not valid_citation(citation):
	# Merge to previous one misc txt
	if previous_citation:
	citation_to_merge_into = previous_citation
	else:
	citation_to_merge_into = splitted_citations[1]

	for el in citation:
	add_misc(citation_to_merge_into[-1], el['misc_txt'])

	previous_citation = citation

	return [citation for citation in splitted_citations \
	if valid_citation(citation)]


	+def merge_invalid_references(splitted_citations):
	+ def add_misc(el, txt):
	+ if not el.get('misc_txt'):
	+ el['misc_txt'] = txt
	+ else:
	+ el['misc_txt'] += " " + txt
	+
	+ splitted_citations = [citation for citation in splitted_citations \
	+ if citation]
	+
	+ # We merge some elements in here which means it only makes sense when
	+ # we have at least 2 elements to merge together
	+ if len(splitted_citations) > 1:
	+ previous_citation = None
	+ previous_citation_valid = True
	+ for citation in splitted_citations:
	+ current_citation_valid = valid_citation(citation)
	+ if not current_citation_valid:
	+ # Merge to previous one misc txt
	+ if not previous_citation_valid and not current_citation_valid:
	+ for el in citation:
	+ add_misc(previous_citation[-1], el['misc_txt'])
	+
	+ previous_citation = citation
	+ previous_citation_valid = current_citation_valid
	+
	+ return [citation for citation in splitted_citations \
	+ if valid_citation(citation)]
	+
	+
	def add_year_elements(splitted_citations):
	for citation in splitted_citations:
	for el in citation:
	if el['type'] == 'YEAR':
	continue

	year = None
	for el in citation:
	if el['type'] == 'JOURNAL' or el['type'] == 'BOOK' \
	and 'year' in el:
	year = el['year']
	break
	if year:
	citation.append({'type': 'YEAR',
	'year': year,
	'misc_txt': '',
	})

	return splitted_citations


	+def look_for_implied_ibids(splitted_citations):
	+ def look_for_journal(els):
	+ for el in els:
	+ if el['type'] == 'JOURNAL':
	+ return True
	+ return False
	+
	+ current_journal = None
	+ for citation in splitted_citations:
	+ if current_journal and not look_for_journal(citation):
	+ for el in citation:
	+ if el['type'] == 'MISC':
	+ numeration = find_numeration(el['misc_txt'])
	+ if numeration:
	+ if not numeration['series']:
	+ numeration['series'] = extract_series_from_volume(current_journal['volume'])
	+ if numeration['series']:
	+ volume = numeration['series'] + numeration['volume']
	+ else:
	+ volume = numeration['volume']
	+ ibid_el = {'type' : 'JOURNAL',
	+ 'misc_txt' : '',
	+ 'title' : current_journal['title'],
	+ 'volume' : volume,
	+ 'year' : numeration['year'],
	+ 'page' : numeration['page'],
	+ 'is_ibid' : True,
	+ 'extra_ibids': []}
	+ citation.append(ibid_el)
	+ el['misc_txt'] = el['misc_txt'][numeration['len']:]
	+
	+ current_journal = None
	+ for el in citation:
	+ if el['type'] == 'JOURNAL':
	+ current_journal = el
	+
	+ return splitted_citations
	+
	+
	+def remove_duplicated_authors(splitted_citations):
	+ for citation in splitted_citations:
	+ found_author = False
	+ for el in citation:
	+ if el['type'] == 'AUTH':
	+ if found_author:
	+ el['type'] = 'MISC'
	+ el['misc_txt'] = el['misc_txt'] + " " + el['auth_txt']
	+ else:
	+ found_author = True
	+
	+ return splitted_citations
	+
	+
	+def remove_duplicated_dois(splitted_citations):
	+ for citation in splitted_citations:
	+ found_doi = False
	+ for el in citation[:]:
	+ if el['type'] == 'DOI':
	+ if found_doi:
	+ citation.remove(el)
	+ else:
	+ found_doi = True
	+
	+ return splitted_citations
	+
	+
	+def add_recid_elements(splitted_citations):
	+ for citation in splitted_citations:
	+ for el in citation:
	+ if el.get('recid', None):
	+ citation.append({'type': 'RECID',
	+ 'recid': el['recid'],
	+ 'misc_txt': ''})
	+ break
	+
	+
	## End of elements transformations


	def print_citations(splitted_citations, line_marker):
	write_message('* splitted_citations', verbose=9)
	write_message(' * line marker %s' % line_marker, verbose=9)
	for citation in splitted_citations:
	write_message(" * elements", verbose=9)
	for el in citation:
	write_message(' * %s %s' % (el['type'], repr(el)), verbose=9)


	def parse_reference_line(ref_line, kbs, bad_titles_count={}):
	"""Parse one reference line

	@input a string representing a single reference bullet
	@output parsed references (a list of elements objects)
	"""
	# Strip the 'marker' (e.g. [1]) from this reference line:
	- (line_marker, ref_line) = remove_reference_line_marker(ref_line)
	+ line_marker, ref_line = remove_reference_line_marker(ref_line)
	# Find DOI sections in citation
	- (ref_line, identified_dois) = identify_and_tag_DOI(ref_line)
	+ ref_line, identified_dois = identify_and_tag_DOI(ref_line)
	# Identify and replace URLs in the line:
	- (ref_line, identified_urls) = identify_and_tag_URLs(ref_line)
	+ ref_line, identified_urls = identify_and_tag_URLs(ref_line)
	# Tag <cds.JOURNAL>, etc.
	tagged_line, bad_titles_count = tag_reference_line(ref_line,
	kbs,
	bad_titles_count)

	# Debug print tagging (authors, titles, volumes, etc.)
	write_message('* tags %r' % tagged_line, verbose=9)

	# Using the recorded information, create a MARC XML representation
	# of the rebuilt line:
	# At the same time, get stats of citations found in the reference line
	# (titles, urls, etc):
	citation_elements, line_marker, counts = \
	parse_tagged_reference_line(line_marker,
	tagged_line,
	identified_dois,
	identified_urls)

	# Transformations on elements
	- citation_elements = split_volume_from_journal(citation_elements)
	- citation_elements = format_volume(citation_elements)
	- citation_elements = handle_special_journals(citation_elements, kbs)
	- citation_elements = format_report_number(citation_elements)
	- citation_elements = format_author_ed(citation_elements)
	- citation_elements = look_for_books(citation_elements, kbs)
	- citation_elements = format_hep(citation_elements)
	- citation_elements = remove_b_for_nucl_phys(citation_elements)
	- citation_elements = mangle_volume(citation_elements)
	+ split_volume_from_journal(citation_elements)
	+ format_volume(citation_elements)
	+ handle_special_journals(citation_elements, kbs)
	+ format_report_number(citation_elements)
	+ format_author_ed(citation_elements)
	+ look_for_books(citation_elements, kbs)
	+ format_hep(citation_elements)
	+ remove_b_for_nucl_phys(citation_elements)
	+ mangle_volume(citation_elements)
	+ associate_recids(citation_elements)

	# Split the reference in multiple ones if needed
	splitted_citations = split_citations(citation_elements)
	-
	+ # Look for implied ibids
	+ look_for_implied_ibids(splitted_citations)
	+ # Associate recids to the newly added ibids
	+ associate_recids_catchup(splitted_citations)
	# Remove references with only misc text
	- splitted_citations = remove_invalid_references(splitted_citations)
	+ # splitted_citations = remove_invalid_references(splitted_citations)
	+ # Merge references with only misc text
	+ # splitted_citations = merge_invalid_references(splitted_citations)
	# Find year
	- splitted_citations = add_year_elements(splitted_citations)
	+ add_year_elements(splitted_citations)
	+ # Remove duplicate authors
	+ remove_duplicated_authors(splitted_citations)
	+ # Remove duplicate DOIs
	+ remove_duplicated_dois(splitted_citations)
	+ # Add recid elements
	+ add_recid_elements(splitted_citations)
	# For debugging puposes
	print_citations(splitted_citations, line_marker)

	return splitted_citations, line_marker, counts, bad_titles_count


	def parse_references_elements(ref_sect, kbs):
	"""Passed a complete reference section, process each line and attempt to
	## identify and standardise individual citations within the line.
	@param ref_sect: (list) of strings - each string in the list is a
	reference line.
	@param preprint_repnum_search_kb: (dictionary) - keyed by a tuple
	containing the line-number of the pattern in the KB and the non-standard
	category string. E.g.: (3, 'ASTRO PH'). Value is regexp pattern used to
	search for that report-number.
	@param preprint_repnum_standardised_categs: (dictionary) - keyed by non-
	standard version of institutional report number, value is the
	standardised version of that report number.
	@param periodical_title_search_kb: (dictionary) - keyed by non-standard
	title to search for, value is the compiled regexp pattern used to
	search for that title.
	@param standardised_periodical_titles: (dictionary) - keyed by non-
	standard title to search for, value is the standardised version of that
	title.
	@param periodical_title_search_keys: (list) - ordered list of non-
	standard titles to search for.
	@return: (tuple) of 6 components:
	( list -> of strings, each string is a MARC XML-ized reference
	line.
	integer -> number of fields of miscellaneous text found for the
	record.
	integer -> number of title citations found for the record.
	integer -> number of institutional report-number citations found
	for the record.
	integer -> number of URL citations found for the record.
	integer -> number of DOI's found
	integer -> number of author groups found
	dictionary -> The totals for each 'bad title' found in the reference
	section.
	)
	"""
	# a list to contain the processed reference lines:
	citations = []
	# counters for extraction stats:
	counts = {
	'misc': 0,
	'title': 0,
	'reportnum': 0,
	'url': 0,
	'doi': 0,
	'auth_group': 0,
	}
	# A dictionary to contain the total count of each 'bad title' found
	# in the entire reference section:
	bad_titles_count = {}

	# process references line-by-line:
	for ref_line in ref_sect:
	citation_elements, line_marker, this_counts, bad_titles_count = \
	parse_reference_line(ref_line, kbs, bad_titles_count)

	# Accumulate stats
	counts = sum_2_dictionaries(counts, this_counts)

	citations.append({'elements' : citation_elements,
	'line_marker': line_marker})

	# Return the list of processed reference lines:
	return citations, counts, bad_titles_count


	def parse_tagged_reference_line(line_marker,
	line,
	identified_dois,
	identified_urls):

	""" Given a single tagged reference line, convert it to its MARC-XML representation.
	Try to find all tags and extract their contents and their types into corresponding
	dictionary elements. Append each dictionary tag representation onto a list, which
	is given to 'build_formatted_xml_citation()' where the correct xml output will be generated.

	This method is dumb, with very few heuristics. It simply looks for tags, and makes dictionaries
	from the data it finds in a tagged reference line.

	@param line_marker: (string) The line marker for this single reference line (e.g. [19])
	@param line: (string) The tagged reference line.
	@param identified_dois: (list) a list of dois which were found in this line. The ordering of
	dois corresponds to the ordering of tags in the line, reading from left to right.
	@param identified_urls: (list) a list of urls which were found in this line. The ordering of
	urls corresponds to the ordering of tags in the line, reading from left to right.
	@param which format to use for references,
	roughly "<title> <volume> <page>" or "<title>,<volume>,<page>"
	@return xml_line: (string) the MARC-XML representation of the tagged reference line
	@return count_: (integer) the number of (pieces of info) found in the reference line.
	"""
	count_misc = count_title = count_reportnum = count_url = count_doi = count_auth_group = 0
	processed_line = line
	cur_misc_txt = u""

	tag_match = re_tagged_citation.search(processed_line)

	# contains a list of dictionary entries of previously cited items
	citation_elements = []
	# the last tag element found when working from left-to-right across the line
	identified_citation_element = None

	while tag_match is not None:
	# While there are tags inside this reference line...
	tag_match_start = tag_match.start()
	tag_match_end = tag_match.end()
	tag_type = tag_match.group(1)
	cur_misc_txt += processed_line[0:tag_match_start]

	# Catches both standard titles, and ibid's
	if tag_type.find("JOURNAL") != -1:
	# This tag is an identified journal TITLE. It should be followed
	# by VOLUME, YEAR and PAGE tags.

	# See if the found title has been tagged as an ibid: <cds.JOURNALibid>
	if tag_match.group('ibid'):
	is_ibid = True
	closing_tag_length = len(CFG_REFEXTRACT_MARKER_CLOSING_TITLE_IBID)
	idx_closing_tag = processed_line.find(CFG_REFEXTRACT_MARKER_CLOSING_TITLE_IBID,
	tag_match_end)
	else:
	is_ibid = False
	closing_tag_length = len(CFG_REFEXTRACT_MARKER_CLOSING_TITLE)
	# extract the title from the line:
	idx_closing_tag = processed_line.find(CFG_REFEXTRACT_MARKER_CLOSING_TITLE,
	tag_match_end)

	if idx_closing_tag == -1:
	# no closing TITLE tag found - get rid of the solitary tag
	processed_line = processed_line[tag_match_end:]
	identified_citation_element = None
	else:

	# Closing tag was found:
	# The title text to be used in the marked-up citation:
	title_text = processed_line[tag_match_end:idx_closing_tag]

	# Now trim this matched title and its tags from the start of the line:
	processed_line = processed_line[idx_closing_tag+closing_tag_length:]

	numeration_match = re_recognised_numeration_for_title_plus_series.search(processed_line)
	if numeration_match:
	# recognised numeration immediately after the title - extract it:
	reference_volume = numeration_match.group('vol')
	reference_year = numeration_match.group('yr') or ''
	reference_page = numeration_match.group('pg')

	# This is used on two accounts:
	# 1. To get the series char from the title, if no series was found with the numeration
	# 2. To always remove any series character from the title match text
	# series_from_title = re_series_from_title.search(title_text)
	#
	if numeration_match.group('series'):
	reference_volume = numeration_match.group('series') + reference_volume

	# Skip past the matched numeration in the working line:
	processed_line = processed_line[numeration_match.end():]

	# 'id_ibid' saves whether THIS TITLE is an ibid or not. (True or False)
	# 'extra_ibids' are there to hold ibid's without the word 'ibid', which
	# come directly after this title
	# i.e., they are recognised using title numeration instead of ibid notation
	identified_citation_element = {'type' : "JOURNAL",
	'misc_txt' : cur_misc_txt,
	'title' : title_text,
	'volume' : reference_volume,
	'year' : reference_year,
	'page' : reference_page,
	'is_ibid' : is_ibid,
	'extra_ibids': []
	}
	count_title += 1
	cur_misc_txt = u""

	# Try to find IBID's after this title, on top of previously found titles that were
	# denoted with the word 'IBID'. (i.e. look for IBID's without the word 'IBID' by
	# looking at extra numeration after this title)

	numeration_match = re_numeration_no_ibid_txt.match(processed_line)
	while numeration_match is not None:

	reference_volume = numeration_match.group('vol')
	reference_year = numeration_match.group('yr')
	reference_page = numeration_match.group('pg')

	if numeration_match.group('series'):
	reference_volume = numeration_match.group('series') + reference_volume

	# Skip past the matched numeration in the working line:
	processed_line = processed_line[numeration_match.end():]

	# Takes the just found title text
	identified_citation_element['extra_ibids'].append(
	{'type' : "JOURNAL",
	'misc_txt' : "",
	'title' : title_text,
	'volume' : reference_volume,
	'year' : reference_year,
	'page' : reference_page,
	})
	# Increment the stats counters:
	count_title += 1

	title_text = ""
	reference_volume = ""
	reference_year = ""
	reference_page = ""
	numeration_match = re_numeration_no_ibid_txt.match(processed_line)
	else:
	# No numeration was recognised after the title. Add the title into a MISC item instead:
	cur_misc_txt += "%s" % title_text
	identified_citation_element = None

	elif tag_type == "REPORTNUMBER":

	# This tag is an identified institutional report number:

	# extract the institutional report-number from the line:
	idx_closing_tag = processed_line.find(CFG_REFEXTRACT_MARKER_CLOSING_REPORT_NUM,
	tag_match_end)
	# Sanity check - did we find a closing report-number tag?
	if idx_closing_tag == -1:
	# no closing </cds.REPORTNUMBER> tag found - strip the opening tag and move past this
	# recognised reportnumber as it is unreliable:
	processed_line = processed_line[tag_match_end:]
	identified_citation_element = None
	else:
	# closing tag was found
	report_num = processed_line[tag_match_end:idx_closing_tag]
	# now trim this matched institutional report-number
	# and its tags from the start of the line:
	ending_tag_pos = idx_closing_tag \
	+ len(CFG_REFEXTRACT_MARKER_CLOSING_REPORT_NUM)
	processed_line = processed_line[ending_tag_pos:]

	identified_citation_element = {'type' : "REPORTNUMBER",
	'misc_txt' : cur_misc_txt,
	'report_num' : report_num}
	count_reportnum += 1
	cur_misc_txt = u""

	elif tag_type == "URL":
	# This tag is an identified URL:

	# From the "identified_urls" list, get this URL and its
	# description string:
	url_string = identified_urls[0][0]
	url_desc = identified_urls[0][1]

	# Now move past this "<cds.URL />"tag in the line:
	processed_line = processed_line[tag_match_end:]

	# Delete the information for this URL from the start of the list
	# of identified URLs:
	identified_urls[0:1] = []

	# Save the current misc text
	identified_citation_element = {
	'type' : "URL",
	'misc_txt' : "%s" % cur_misc_txt,
	'url_string' : "%s" % url_string,
	'url_desc' : "%s" % url_desc
	}

	count_url += 1
	cur_misc_txt = u""

	elif tag_type == "DOI":
	# This tag is an identified DOI:

	# From the "identified_dois" list, get this DOI and its
	# description string:
	doi_string = identified_dois[0]

	# Now move past this "<cds.CDS />"tag in the line:
	processed_line = processed_line[tag_match_end:]

	# Remove DOI from the list of DOI strings
	identified_dois[0:1] = []

	# SAVE the current misc text
	identified_citation_element = {
	'type' : "DOI",
	'misc_txt' : "%s" % cur_misc_txt,
	'doi_string' : "%s" % doi_string
	}

	# Increment the stats counters:
	count_doi += 1
	cur_misc_txt = u""

	elif tag_type.find("AUTH") != -1:
	# This tag is an identified Author:

	auth_type = ""
	# extract the title from the line:
	if tag_type.find("stnd") != -1:
	auth_type = "stnd"
	idx_closing_tag_nearest = processed_line.find(
	CFG_REFEXTRACT_MARKER_CLOSING_AUTHOR_STND, tag_match_end)
	elif tag_type.find("etal") != -1:
	auth_type = "etal"
	idx_closing_tag_nearest = processed_line.find(
	CFG_REFEXTRACT_MARKER_CLOSING_AUTHOR_ETAL, tag_match_end)
	elif tag_type.find("incl") != -1:
	auth_type = "incl"
	idx_closing_tag_nearest = processed_line.find(
	CFG_REFEXTRACT_MARKER_CLOSING_AUTHOR_INCL, tag_match_end)

	if idx_closing_tag_nearest == -1:
	# no closing </cds.AUTH****> tag found - strip the opening tag
	# and move past it
	processed_line = processed_line[tag_match_end:]
	identified_citation_element = None
	else:
	auth_txt = processed_line[tag_match_end:idx_closing_tag_nearest]
	# Now move past the ending tag in the line:
	processed_line = processed_line[idx_closing_tag_nearest + len("</cds.AUTHxxxx>"):]
	#SAVE the current misc text
	identified_citation_element = {
	'type' : "AUTH",
	'misc_txt' : "%s" % cur_misc_txt,
	'auth_txt' : "%s" % auth_txt,
	'auth_type' : "%s" % auth_type
	}

	# Increment the stats counters:
	count_auth_group += 1
	cur_misc_txt = u""

	# These following tags may be found separately;
	# They are usually found when a "JOURNAL" tag is hit
	# (ONLY immediately afterwards, however)
	# Sitting by themselves means they do not have
	# an associated TITLE tag, and should be MISC
	elif tag_type == "SER":
	# This tag is a SERIES tag; Since it was not preceeded by a TITLE
	# tag, it is useless - strip the tag and put it into miscellaneous:
	(cur_misc_txt, processed_line) = \
	convert_unusable_tag_to_misc(processed_line, cur_misc_txt,
	tag_match_end,
	CFG_REFEXTRACT_MARKER_CLOSING_SERIES)
	identified_citation_element = None

	elif tag_type == "VOL":
	# This tag is a VOLUME tag; Since it was not preceeded by a TITLE
	# tag, it is useless - strip the tag and put it into miscellaneous:
	(cur_misc_txt, processed_line) = \
	convert_unusable_tag_to_misc(processed_line, cur_misc_txt,
	tag_match_end,
	CFG_REFEXTRACT_MARKER_CLOSING_VOLUME)
	identified_citation_element = None

	elif tag_type == "YR":
	# This tag is a YEAR tag; Since it's not preceeded by TITLE and
	# VOLUME tags, it is useless - strip the tag and put the contents
	# into miscellaneous:
	(cur_misc_txt, processed_line) = \
	convert_unusable_tag_to_misc(processed_line, cur_misc_txt,
	tag_match_end,
	CFG_REFEXTRACT_MARKER_CLOSING_YEAR)
	identified_citation_element = None

	elif tag_type == "PG":
	# This tag is a PAGE tag; Since it's not preceeded by TITLE,
	# VOLUME and YEAR tags, it is useless - strip the tag and put the
	# contents into miscellaneous:
	(cur_misc_txt, processed_line) = \
	convert_unusable_tag_to_misc(processed_line, cur_misc_txt,
	tag_match_end,
	CFG_REFEXTRACT_MARKER_CLOSING_PAGE)
	identified_citation_element = None

	elif tag_type == "QUOTED":
	identified_citation_element, processed_line, cur_misc_txt = \
	map_tag_to_subfield(tag_type,
	processed_line[tag_match_end:],
	cur_misc_txt,
	'title')

	elif tag_type == "ISBN":
	identified_citation_element, processed_line, cur_misc_txt = \
	map_tag_to_subfield(tag_type,
	processed_line[tag_match_end:],
	cur_misc_txt,
	tag_type)

	elif tag_type == "PUBLISHER":
	identified_citation_element, processed_line, cur_misc_txt = \
	map_tag_to_subfield(tag_type,
	processed_line[tag_match_end:],
	cur_misc_txt,
	'publisher')

	+ elif tag_type == "COLLABORATION":
	+ identified_citation_element, processed_line, cur_misc_txt = \
	+ map_tag_to_subfield(tag_type,
	+ processed_line[tag_match_end:],
	+ cur_misc_txt,
	+ 'collaboration')
	+
	if identified_citation_element:
	# Append the found tagged data and current misc text
	citation_elements.append(identified_citation_element)
	identified_citation_element = None

	# Look for the next tag in the processed line:
	tag_match = re_tagged_citation.search(processed_line)

	# place any remaining miscellaneous text into the
	# appropriate MARC XML fields:
	cur_misc_txt += processed_line

	# This MISC element will hold the entire citation in the event
	# that no tags were found.
	if len(cur_misc_txt.strip(" .;,")) > 0:
	# Increment the stats counters:
	count_misc += 1
	identified_citation_element = {
	'type' : "MISC",
	'misc_txt' : "%s" % cur_misc_txt,
	}
	citation_elements.append(identified_citation_element)

	return (citation_elements, line_marker, {
	'misc': count_misc,
	'title': count_title,
	'reportnum': count_reportnum,
	'url': count_url,
	'doi': count_doi,
	'auth_group': count_auth_group
	})


	def map_tag_to_subfield(tag_type, line, cur_misc_txt, dest):
	"""Create a new reference element"""
	closing_tag = '</cds.%s>' % tag_type
	# extract the institutional report-number from the line:
	idx_closing_tag = line.find(closing_tag)
	# Sanity check - did we find a closing tag?
	if idx_closing_tag == -1:
	# no closing </cds.TAG> tag found - strip the opening tag and move past this
	# recognised reportnumber as it is unreliable:
	identified_citation_element = None
	line = line[len('<cds.%s>' % tag_type):]
	else:
	tag_content = line[:idx_closing_tag]
	identified_citation_element = {'type' : tag_type,
	'misc_txt' : cur_misc_txt,
	dest : tag_content}
	ending_tag_pos = idx_closing_tag + len(closing_tag)
	line = line[ending_tag_pos:]
	cur_misc_txt = u""

	return identified_citation_element, line, cur_misc_txt


	def convert_unusable_tag_to_misc(line,
	misc_text,
	tag_match_end,
	closing_tag):
	"""Function to remove an unwanted, tagged, citation item from a reference
	line. The tagged item itself is put into the miscellaneous text variable;
	the data up to the closing tag is then trimmed from the beginning of the
	working line. For example, the following working line:
	Example, AN. Testing software; <cds.YR>(2001)</cds.YR>, CERN, Geneva.
	...would be trimmed down to:
	, CERN, Geneva.
	...And the Miscellaneous text taken from the start of the line would be:
	Example, AN. Testing software; (2001)
	...(assuming that the details of <cds.YR> and </cds.YR> were passed to
	the function).
	@param line: (string) - the reference line.
	@param misc_text: (string) - the variable containing the miscellaneous
	text recorded so far.
	@param tag_match_end: (integer) - the index of the end of the opening tag
	in the line.
	@param closing_tag: (string) - the closing tag to look for in the line
	(e.g. </cds.YR>).
	@return: (tuple) - containing misc_text (string) and line (string)
	"""

	# extract the tagged information:
	idx_closing_tag = line.find(closing_tag, tag_match_end)
	# Sanity check - did we find a closing tag?
	if idx_closing_tag == -1:
	# no closing tag found - strip the opening tag and move past this
	# recognised item as it is unusable:
	line = line[tag_match_end:]
	else:
	# closing tag was found
	misc_text += line[tag_match_end:idx_closing_tag]
	# now trim the matched item and its tags from the start of the line:
	line = line[idx_closing_tag+len(closing_tag):]
	return (misc_text, line)

	# Tasks related to extraction of reference section from full-text:

	# ----> 1. Removing page-breaks, headers and footers before
	# searching for reference section:

	# ----> 2. Finding reference section in full-text:

	# ----> 3. Found reference section - now take out lines and rebuild them:


	def remove_leading_garbage_lines_from_reference_section(ref_sectn):
	"""Sometimes, the first lines of the extracted references are completely
	blank or email addresses. These must be removed as they are not
	references.
	@param ref_sectn: (list) of strings - the reference section lines
	@return: (list) of strings - the reference section without leading
	blank lines or email addresses.
	"""
	p_email = re.compile(ur'^\s*e\-?mail', re.UNICODE)
	while ref_sectn and (ref_sectn[0].isspace() or p_email.match(ref_sectn[0])):
	ref_sectn.pop(0)
	return ref_sectn


	# ----> Glue - logic for finding and extracting reference section:


	# Tasks related to conversion of full-text to plain-text:

	def get_plaintext_document_body(fpath, keep_layout=False):
	"""Given a file-path to a full-text, return a list of unicode strings
	whereby each string is a line of the fulltext.
	In the case of a plain-text document, this simply means reading the
	contents in from the file. In the case of a PDF/PostScript however,
	this means converting the document to plaintext.
	@param fpath: (string) - the path to the fulltext file
	@return: (list) of strings - each string being a line in the document.
	"""
	textbody = []
	status = 0
	if os.access(fpath, os.F_OK\|os.R_OK):
	# filepath OK - attempt to extract references:
	# get file type:
	cmd_pdftotext = [CFG_PATH_GFILE, fpath]
	pipe_pdftotext = subprocess.Popen(cmd_pdftotext, stdout=subprocess.PIPE)
	res_gfile = pipe_pdftotext.stdout.read()

	if (res_gfile.lower().find("text") != -1) and \
	(res_gfile.lower().find("pdf") == -1):
	# plain-text file: don't convert - just read in:
	f = open(fpath, "r")
	try:
	textbody = [line.decode("utf-8") for line in f.readlines()]
	finally:
	f.close()
	elif (res_gfile.lower().find("pdf") != -1) or \
	(res_gfile.lower().find("pdfa") != -1):
	# convert from PDF
	(textbody, status) = convert_PDF_to_plaintext(fpath, keep_layout)
	else:
	# invalid format
	status = 1
	else:
	# filepath not OK
	status = 1
	return (textbody, status)


	-def build_xml_references(citations):
	- """Build marc xml from a references list
	-
	- Transform the reference elements into marc xml
	- """
	- xml_references = []
	-
	- for c in citations:
	- # Now, run the method which will take as input:
	- # 1. A list of lists of dictionaries, where each dictionary is a piece
	- # of citation information corresponding to a tag in the citation.
	- # 2. The line marker for this entire citation line (mulitple citation
	- # 'finds' inside a single citation will use the same marker value)
	- # The resulting xml line will be a properly marked up form of the
	- # citation. It will take into account authors to try and split up
	- # references which should be read as two SEPARATE ones.
	- xml_lines = build_xml_citations(c['elements'],
	- c['line_marker'])
	- xml_references.extend(xml_lines)
	-
	- return xml_references
	-
	-
	-def parse_references(reference_lines, recid=1, kbs_files=None):
	+def parse_references(reference_lines, recid=None, kbs_files=None):
	"""Parse a list of references

	Given a list of raw reference lines (list of strings),
	output the MARC-XML content extracted version
	"""
	# RefExtract knowledge bases
	kbs = get_kbs(custom_kbs_files=kbs_files)
	# Identify journal titles, report numbers, URLs, DOIs, and authors...
	- (processed_references, counts, dummy_bad_titles_count) = \
	+ processed_references, counts, dummy_bad_titles_count = \
	parse_references_elements(reference_lines, kbs)
	# Generate marc xml using the elements list
	- xml_out = build_xml_references(processed_references)
	+ fields = build_references(processed_references)
	# Generate the xml string to be outputted
	- return create_xml_record(counts, recid, xml_out)
	+ return build_record(counts, fields, recid=recid)
	diff --git a/modules/docextract/lib/refextract_find.py b/modules/docextract/lib/refextract_find.py
	index c49d9f187..f9a3166e3 100644
	--- a/modules/docextract/lib/refextract_find.py
	+++ b/modules/docextract/lib/refextract_find.py
	@@ -1,493 +1,499 @@
	# -- coding: utf-8 --
	##
	## This file is part of Invenio.
	## Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010, 2011 CERN.
	##
	## Invenio is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## Invenio is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	"""Finding the reference section from the fulltext"""

	import re

	from invenio.docextract_utils import write_message
	from invenio.refextract_re import \
	get_reference_section_title_patterns, \
	get_reference_line_numeration_marker_patterns, \
	regex_match_list, \
	get_post_reference_section_title_patterns, \
	get_post_reference_section_keyword_patterns, \
	re_reference_line_bracket_markers, \
	re_reference_line_dot_markers, \
	re_reference_line_number_markers, \
	re_num


	def find_reference_section(docbody):
	"""Search in document body for its reference section.

	More precisely, find
	the first line of the reference section. Effectively, the function starts
	at the end of a document and works backwards, line-by-line, looking for
	the title of a reference section. It stops when (if) it finds something
	that it considers to be the first line of a reference section.
	@param docbody: (list) of strings - the full document body.
	@return: (dictionary) :
	{ 'start_line' : (integer) - index in docbody of 1st reference line,
	'title_string' : (string) - title of the reference section.
	'marker' : (string) - the marker of the first reference line,
	'marker_pattern' : (string) - regexp string used to find the marker,
	'title_marker_same_line' : (integer) - flag to indicate whether the
	reference section title was on the same
	line as the first reference line's
	marker or not. 1 if it was; 0 if not.
	}
	Much of this information is used by later functions to rebuild
	a reference section.
	-- OR --
	(None) - when the reference section could not be found.
	"""
	ref_details = None
	title_patterns = get_reference_section_title_patterns()

	# Try to find refs section title:
	for reversed_index, line in enumerate(reversed(docbody)):
	title_match = regex_match_list(line, title_patterns)
	if title_match:
	title = title_match.group('title')
	index = len(docbody) - 1 - reversed_index
	temp_ref_details, found_title = find_numeration(docbody[index:index+3], title)
	if temp_ref_details:
	if ref_details and 'title' in ref_details \
	and ref_details['title'] \
	and not temp_ref_details['title']:
	continue
	if ref_details and 'marker' in ref_details \
	and ref_details['marker'] \
	and not temp_ref_details['marker']:
	continue

	ref_details = temp_ref_details
	ref_details['start_line'] = index
	ref_details['title_string'] = title

	if found_title:
	break

	return ref_details


	def find_numeration_in_body(docbody):
	marker_patterns = get_reference_line_numeration_marker_patterns()
	ref_details = None
	found_title = False

	+ # No numeration unless we find one
	+ ref_details = {
	+ 'title_marker_same_line': False,
	+ 'marker': None,
	+ 'marker_pattern': None,
	+ }
	+
	for line in docbody:
	# Move past blank lines
	if line.isspace():
	continue

	# Is this line numerated like a reference line?
	+ m_num = None
	mark_match = regex_match_list(line, marker_patterns)
	if mark_match:
	+ # Check if it's the first reference
	+ # Something like [1] or (1), etc.
	+ try:
	+ m_num = mark_match.group('marknum')
	+ if m_num != '1':
	+ continue
	+ except IndexError:
	+ pass
	+
	mark = mark_match.group('mark')
	mk_ptn = mark_match.re.pattern
	ref_details = {
	'marker': mark,
	'marker_pattern': mk_ptn,
	'title_marker_same_line': False,
	}
	- # Check if it's the first reference
	- # Something like [1] or (1), etc.
	- m_num = re_num.search(mark)
	- if m_num and m_num.group(0) == '1':
	- # 1st ref truly found
	- break
	- else:
	- # No numeration
	- ref_details = {
	- 'title_marker_same_line': False,
	- 'marker': None,
	- 'marker_pattern': None,
	- }
	+
	+ break

	return ref_details, found_title


	def find_numeration_in_title(docbody, title):
	ref_details = None
	found_title = False

	try:
	first_line = docbody[0]
	except IndexError:
	return ref_details, found_title

	# Need to escape to avoid problems like 'References['
	title = re.escape(title)

	mk_with_title_ptns = \
	get_reference_line_numeration_marker_patterns(title)
	mk_with_title_match = \
	regex_match_list(first_line, mk_with_title_ptns)
	if mk_with_title_match:
	mk = mk_with_title_match.group('mark')
	mk_ptn = mk_with_title_match.re.pattern
	m_num = re_num.search(mk)
	if m_num and m_num.group(0) == '1':
	# Mark found
	found_title = True
	ref_details = {
	'marker': mk,
	'marker_pattern': mk_ptn,
	'title_marker_same_line': True
	}
	else:
	ref_details = {
	'marker': mk,
	'marker_pattern': mk_ptn,
	'title_marker_same_line': True
	}

	return ref_details, found_title


	def find_numeration(docbody, title):
	"""Find numeration pattern

	1st try to find numeration in the title
	e.g.
	References [4] Riotto...

	2nd find the numeration alone in the line after the title
	e.g.
	References
	1
	Riotto

	3rnd find the numeration in the following line
	e.g.
	References
	[1] Riotto
	"""
	ref_details, found_title = find_numeration_in_title(docbody, title)
	if not ref_details:
	ref_details, found_title = find_numeration_in_body(docbody)

	return ref_details, found_title


	def find_reference_section_no_title_via_brackets(docbody):
	"""This function would generally be used when it was not possible to locate
	the start of a document's reference section by means of its title.
	Instead, this function will look for reference lines that have numeric
	markers of the format [1], [2], etc.
	@param docbody: (list) of strings -each string is a line in the document.
	@return: (dictionary) :
	{ 'start_line' : (integer) - index in docbody of 1st reference line,
	'title_string' : (None) - title of the reference section
	(None since no title),
	'marker' : (string) - the marker of the first reference line,
	'marker_pattern' : (string) - the regexp string used to find the
	marker,
	'title_marker_same_line' : (integer) 0 - to signal title not on same
	line as marker.
	}
	Much of this information is used by later functions to rebuild
	a reference section.
	-- OR --
	(None) - when the reference section could not be found.
	"""
	marker_patterns = [re_reference_line_bracket_markers]
	return find_reference_section_no_title_generic(docbody, marker_patterns)


	def find_reference_section_no_title_via_dots(docbody):
	"""This function would generally be used when it was not possible to locate
	the start of a document's reference section by means of its title.
	Instead, this function will look for reference lines that have numeric
	markers of the format 1., 2., etc.
	@param docbody: (list) of strings -each string is a line in the document.
	@return: (dictionary) :
	{ 'start_line' : (integer) - index in docbody of 1st reference line,
	'title_string' : (None) - title of the reference section
	(None since no title),
	'marker' : (string) - the marker of the first reference line,
	'marker_pattern' : (string) - the regexp string used to find the
	marker,
	'title_marker_same_line' : (integer) 0 - to signal title not on same
	line as marker.
	}
	Much of this information is used by later functions to rebuild
	a reference section.
	-- OR --
	(None) - when the reference section could not be found.
	"""
	marker_patterns = [re_reference_line_dot_markers]
	return find_reference_section_no_title_generic(docbody, marker_patterns)


	def find_reference_section_no_title_via_numbers(docbody):
	"""This function would generally be used when it was not possible to locate
	the start of a document's reference section by means of its title.
	Instead, this function will look for reference lines that have numeric
	markers of the format 1, 2, etc.
	@param docbody: (list) of strings -each string is a line in the document.
	@return: (dictionary) :
	{ 'start_line' : (integer) - index in docbody of 1st reference line,
	'title_string' : (None) - title of the reference section
	(None since no title),
	'marker' : (string) - the marker of the first reference line,
	'marker_pattern' : (string) - the regexp string used to find the
	marker,
	'title_marker_same_line' : (integer) 0 - to signal title not on same
	line as marker.
	}
	Much of this information is used by later functions to rebuild
	a reference section.
	-- OR --
	(None) - when the reference section could not be found.
	"""
	marker_patterns = [re_reference_line_number_markers]
	return find_reference_section_no_title_generic(docbody, marker_patterns)


	def find_reference_section_no_title_generic(docbody, marker_patterns):
	"""This function would generally be used when it was not possible to locate
	the start of a document's reference section by means of its title.
	Instead, this function will look for reference lines that have numeric
	markers of the format [1], [2], {1}, {2}, etc.
	@param docbody: (list) of strings -each string is a line in the document.
	@return: (dictionary) :
	{ 'start_line' : (integer) - index in docbody of 1st reference line,
	'title_string' : (None) - title of the reference section
	(None since no title),
	'marker' : (string) - the marker of the first reference line,
	'marker_pattern' : (string) - the regexp string used to find the
	marker,
	'title_marker_same_line' : (integer) 0 - to signal title not on same
	line as marker.
	}
	Much of this information is used by later functions to rebuild
	a reference section.
	-- OR --
	(None) - when the reference section could not be found.
	"""
	if not docbody:
	return None

	ref_start_line = ref_line_marker = None

	# try to find first reference line in the reference section:
	found_ref_sect = False

	for reversed_index, line in enumerate(reversed(docbody)):
	mark_match = regex_match_list(line.strip(), marker_patterns)
	if mark_match and mark_match.group('marknum') == '1':
	# Get marker recognition pattern:
	mark_pattern = mark_match.re.pattern

	# Look for [2] in next 10 lines:
	next_test_lines = 10

	index = len(docbody) - reversed_index
	zone_to_check = docbody[index:index+next_test_lines]
	if len(zone_to_check) < 5:
	# We found a 1 towards the end, we assume
	# we only have one reference
	found = True
	else:
	# Check for number 2
	found = False
	for l in zone_to_check:
	mark_match2 = regex_match_list(l.strip(), marker_patterns)
	if mark_match2 and mark_match2.group('marknum') == '2':
	found = True
	break

	if found:
	# Found next reference line:
	found_ref_sect = True
	ref_start_line = len(docbody) - 1 - reversed_index
	ref_line_marker = mark_match.group('mark')
	ref_line_marker_pattern = mark_pattern
	break

	if found_ref_sect:
	ref_sectn_details = {
	'start_line' : ref_start_line,
	'title_string' : None,
	'marker' : ref_line_marker.strip(),
	'marker_pattern' : ref_line_marker_pattern,
	'title_marker_same_line' : False,
	}
	else:
	# didn't manage to find the reference section
	ref_sectn_details = None

	return ref_sectn_details


	def find_end_of_reference_section(docbody,
	ref_start_line,
	ref_line_marker,
	ref_line_marker_ptn):
	"""Given that the start of a document's reference section has already been
	recognised, this function is tasked with finding the line-number in the
	document of the last line of the reference section.
	@param docbody: (list) of strings - the entire plain-text document body.
	@param ref_start_line: (integer) - the index in docbody of the first line
	of the reference section.
	@param ref_line_marker: (string) - the line marker of the first reference
	line.
	@param ref_line_marker_ptn: (string) - the pattern used to search for a
	reference line marker.
	@return: (integer) - index in docbody of the last reference line
	-- OR --
	(None) - if ref_start_line was invalid.
	"""
	section_ended = False
	x = ref_start_line
	if type(x) is not int or x < 0 or \
	x > len(docbody) or len(docbody) < 1:
	# The provided 'first line' of the reference section was invalid.
	# Either it was out of bounds in the document body, or it was not a
	# valid integer.
	# Can't safely find end of refs with this info - quit.
	return None
	# Get patterns for testing line:
	t_patterns = get_post_reference_section_title_patterns()
	kw_patterns = get_post_reference_section_keyword_patterns()

	if None not in (ref_line_marker, ref_line_marker_ptn):
	mk_patterns = [re.compile(ref_line_marker_ptn, re.I\|re.UNICODE)]
	else:
	mk_patterns = get_reference_line_numeration_marker_patterns()

	current_reference_count = 0
	while x < len(docbody) and not section_ended:
	# save the reference count
	num_match = regex_match_list(docbody[x].strip(), mk_patterns)
	if num_match:
	try:
	current_reference_count = int(num_match.group('marknum'))
	except (ValueError, IndexError):
	# non numerical references marking
	pass
	# look for a likely section title that would follow a reference section:
	end_match = regex_match_list(docbody[x].strip(), t_patterns)
	if not end_match:
	# didn't match a section title - try looking for keywords that
	# suggest the end of a reference section:
	end_match = regex_match_list(docbody[x].strip(), kw_patterns)
	else:
	# Is it really the end of the reference section? Check within the next
	# 5 lines for other reference numeration markers:
	y = x + 1
	line_found = False
	while y < x + 200 and y < len(docbody) and not line_found:
	num_match = regex_match_list(docbody[y].strip(), mk_patterns)
	if num_match and not num_match.group(0).isdigit():
	try:
	num = int(num_match.group('marknum'))
	if current_reference_count + 1 == num:
	line_found = True
	except ValueError:
	# We have the marknum index so it is
	# numeric pattern for references like
	# [1], [2] but this match is not a number
	pass
	except IndexError:
	# We have a non numerical references marking
	# we don't check for a number continuity
	line_found = True
	y += 1
	if not line_found:
	# No ref line found-end section
	section_ended = True
	if not section_ended:
	# Does this & the next 5 lines simply contain numbers? If yes, it's
	# probably the axis scale of a graph in a fig. End refs section
	digit_test_str = docbody[x].replace(" ", "").\
	replace(".", "").\
	replace("-", "").\
	replace("+", "").\
	replace(u"\u00D7", "").\
	replace(u"\u2212", "").\
	strip()
	if len(digit_test_str) > 10 and digit_test_str.isdigit():
	# The line contains only digits and is longer than 10 chars:
	y = x + 1
	digit_lines = 4
	num_digit_lines = 1
	while y < x + digit_lines and y < len(docbody):
	digit_test_str = docbody[y].replace(" ", "").\
	replace(".", "").\
	replace("-", "").\
	replace("+", "").\
	replace(u"\u00D7", "").\
	replace(u"\u2212", "").\
	strip()
	if len(digit_test_str) > 10 and digit_test_str.isdigit():
	num_digit_lines += 1
	elif len(digit_test_str) == 0:
	# This is a blank line. Don't count it, to accommodate
	# documents that are double-line spaced:
	digit_lines += 1
	y = y + 1
	if num_digit_lines == digit_lines:
	section_ended = True
	x += 1
	return x - 1


	def get_reference_section_beginning(fulltext):

	sect_start = {'start_line' : None,
	'end_line' : None,
	'title_string' : None,
	'marker_pattern' : None,
	'marker' : None,
	'how_found_start': None,
	}

	## Find start of refs section:
	sect_start = find_reference_section(fulltext)
	if sect_start is not None:
	sect_start['how_found_start'] = 1
	else:
	## No references found - try with no title option
	sect_start = find_reference_section_no_title_via_brackets(fulltext)
	if sect_start is not None:
	sect_start['how_found_start'] = 2
	## Try weaker set of patterns if needed
	if sect_start is None:
	## No references found - try with no title option (with weaker patterns..)
	sect_start = find_reference_section_no_title_via_dots(fulltext)
	if sect_start is not None:
	sect_start['how_found_start'] = 3
	if sect_start is None:
	## No references found - try with no title option (with even weaker patterns..)
	sect_start = find_reference_section_no_title_via_numbers(fulltext)
	if sect_start is not None:
	sect_start['how_found_start'] = 4

	if sect_start:
	write_message('* title %r' % sect_start['title_string'], verbose=3)
	write_message('* marker %r' % sect_start['marker'], verbose=3)
	write_message('* title_marker_same_line %s' \
	% sect_start['title_marker_same_line'], verbose=3)
	else:
	write_message('* could not find references section', verbose=3)
	return sect_start
	diff --git a/modules/docextract/lib/refextract_kbs.py b/modules/docextract/lib/refextract_kbs.py
	index 277eba829..aa4af2338 100644
	--- a/modules/docextract/lib/refextract_kbs.py
	+++ b/modules/docextract/lib/refextract_kbs.py
	@@ -1,744 +1,757 @@
	# -- coding: utf-8 --
	##
	## This file is part of Invenio.
	## Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010, 2011 CERN.
	##
	## Invenio is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## Invenio is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	import re
	import sys
	import csv

	try:
	import hashlib
	md5 = hashlib.md5
	except ImportError:
	from md5 import new as md5

	from invenio.refextract_config import CFG_REFEXTRACT_KBS
	from invenio.bibknowledge import get_kbr_items
	from invenio.config import CFG_REFEXTRACT_KBS_OVERRIDE
	from invenio.refextract_re import re_kb_line, \
	re_regexp_character_class, \
	re_report_num_chars_to_escape, \
	re_extract_quoted_text, \
	re_extract_char_class, \
	re_punctuation
	from invenio.docextract_utils import write_message
	from invenio.docextract_text import re_group_captured_multiple_space


	def get_kbs(custom_kbs_files=None, cache=None):
	"""Load kbs (with caching)

	This function stores the loaded kbs into the cache variable
	For the caching to work, it needs to receive an empty dictionary
	as "cache" paramater.
	"""
	if not cache:
	cache = {}

	cache_key = make_cache_key(custom_kbs_files)
	if cache_key not in cache:
	# Build paths from defaults and specified ones
	kbs_files = CFG_REFEXTRACT_KBS.copy()
	for key, path in CFG_REFEXTRACT_KBS_OVERRIDE.items():
	kbs_files[key] = path
	if custom_kbs_files:
	for key, path in custom_kbs_files.items():
	if path:
	kbs_files[key] = path
	# Loads kbs from those paths
	cache[cache_key] = load_kbs(kbs_files)
	return cache[cache_key]


	def load_kbs(kbs_files):
	"""Load kbs (without caching)

	Args:
	- kb_files: list of custom paths you can specify to override the
	default values
	If path starts with "kb:", the kb will be loaded from the database
	"""
	return {
	'journals_re': build_journals_re_kb(kbs_files['journals-re']),
	'journals': load_kb(kbs_files['journals'], build_journals_kb),
	'report-numbers': build_reportnum_kb(kbs_files['report-numbers']),
	'authors': build_authors_kb(kbs_files['authors']),
	'books': build_books_kb(kbs_files['books']),
	'publishers': load_kb(kbs_files['publishers'], build_publishers_kb),
	'special_journals': build_special_journals_kb(kbs_files['special-journals']),
	+ 'collaborations': load_kb(kbs_files['collaborations'], build_collaborations_kb),
	}


	def load_kb(path, builder):
	try:
	path.startswith
	except AttributeError:
	write_message("Loading kb from array", verbose=3)
	return load_kb_from_iterable(path, builder)
	else:
	write_message("Loading kb from %s" % path, verbose=3)
	kb_start = 'kb:'
	if path.startswith(kb_start):
	return load_kb_from_db(path[len(kb_start):], builder)
	else:
	return load_kb_from_file(path, builder)


	def make_cache_key(custom_kbs_files=None):
	"""Create cache key for kbs caches instances

	This function generates a unique key for a given set of arguments.

	The files dictionary is transformed like this:
	{'journal': '/var/journal.kb', 'books': '/var/books.kb'}
	to
	"journal=/var/journal.kb;books=/var/books.kb"

	Then _inspire is appended if we are an INSPIRE site.
	"""
	if custom_kbs_files:
	serialized_args = ('%s=%s' % v for v in custom_kbs_files.iteritems())
	serialized_args = ';'.join(serialized_args)
	else:
	serialized_args = "default"
	cache_key = md5(serialized_args).digest()
	return cache_key


	def order_reportnum_patterns_bylen(numeration_patterns):
	"""Given a list of user-defined patterns for recognising the numeration
	styles of an institute's preprint references, for each pattern,
	strip out character classes and record the length of the pattern.
	Then add the length and the original pattern (in a tuple) into a new
	list for these patterns and return this list.
	@param numeration_patterns: (list) of strings, whereby each string is
	a numeration pattern.
	@return: (list) of tuples, where each tuple contains a pattern and
	its length.
	"""
	def _compfunc_bylen(a, b):
	"""Compares regexp patterns by the length of the pattern-text.
	"""
	if a[0] < b[0]:
	return 1
	elif a[0] == b[0]:
	return 0
	else:
	return -1
	pattern_list = []
	for pattern in numeration_patterns:
	base_pattern = re_regexp_character_class.sub('1', pattern)
	pattern_list.append((len(base_pattern), pattern))
	pattern_list.sort(_compfunc_bylen)
	return pattern_list


	def create_institute_numeration_group_regexp_pattern(patterns):
	"""Using a list of regexp patterns for recognising numeration patterns
	for institute preprint references, ordered by length - longest to
	shortest - create a grouped 'OR' or of these patterns, ready to be
	used in a bigger regexp.
	@param patterns: (list) of strings. All of the numeration regexp
	patterns for recognising an institute's preprint reference styles.
	@return: (string) a grouped 'OR' regexp pattern of the numeration
	patterns. E.g.:
	(?P<num>[12]\d{3} \d\d\d\|\d\d \d\d\d\|[A-Za-z] \d\d\d)
	"""
	grouped_numeration_pattern = u""
	if len(patterns) > 0:
	grouped_numeration_pattern = u"(?P<numn>"
	for pattern in patterns:
	grouped_numeration_pattern += \
	institute_num_pattern_to_regex(pattern[1]) + u"\|"
	grouped_numeration_pattern = \
	grouped_numeration_pattern[0:len(grouped_numeration_pattern) - 1]
	grouped_numeration_pattern += u")"
	return grouped_numeration_pattern


	def institute_num_pattern_to_regex(pattern):
	"""Given a numeration pattern from the institutes preprint report
	numbers KB, convert it to turn it into a regexp string for
	recognising such patterns in a reference line.
	Change:
	\ -> \\
	9 -> \d
	a -> [A-Za-z]
	v -> [Vv] # Tony for arXiv vN
	mm -> (0[1-9]\|1[0-2])
	yy -> \d{2}
	yyyy -> [12]\d{3}
	/ -> \/
	s -> \s*
	@param pattern: (string) a user-defined preprint reference numeration
	pattern.
	@return: (string) the regexp for recognising the pattern.
	"""
	simple_replacements = [
	('9', r'\d'),
	('9+', r'\d+'),
	('w+', r'\w+'),
	('a', r'[A-Za-z]'),
	('v', r'[Vv]'),
	('mm', r'(0[1-9]\|1[0-2])'),
	('yyyy', r'[12]\d{3}'),
	('yy', r'\d\d'),
	('s', r'\s*'),
	- (r'/', r'\/'),
	- ]
	+ (r'/', r'\/')]
	# first, escape certain characters that could be sensitive to a regexp:
	pattern = re_report_num_chars_to_escape.sub(r'\\\g<1>', pattern)

	# now loop through and carry out the simple replacements:
	for repl in simple_replacements:
	pattern = pattern.replace(repl[0], repl[1])

	# now replace a couple of regexp-like paterns:
	# quoted string with non-quoted version ("hello" with hello);
	# Replace / [abcd ]/ with /( [abcd])?/ :
	pattern = re_extract_quoted_text[0].sub(re_extract_quoted_text[1],
	pattern)
	pattern = re_extract_char_class[0].sub(re_extract_char_class[1],
	pattern)

	# the pattern has been transformed
	return pattern


	def build_reportnum_kb(fpath):
	"""Given the path to a knowledge base file containing the details
	of institutes and the patterns that their preprint report
	numbering schemes take, create a dictionary of regexp search
	patterns to recognise these preprint references in reference
	lines, and a dictionary of replacements for non-standard preprint
	categories in these references.

	The knowledge base file should consist only of lines that take one
	of the following 3 formats:

	#####Institute Name####

	(the name of the institute to which the preprint reference patterns
	belong, e.g. '#####LANL#####', surrounded by 5 # on either side.)

	<pattern>

	(numeration patterns for an institute's preprints, surrounded by
	< and >.)

	seek-term --- replace-term
	(i.e. a seek phrase on the left hand side, a replace phrase on the
	right hand side, with the two phrases being separated by 3 hyphens.)
	E.g.:
	ASTRO PH ---astro-ph

	The left-hand side term is a non-standard version of the preprint
	reference category; the right-hand side term is the standard version.

	If the KB file cannot be read from, or an unexpected line is
	encountered in the KB, an error message is output to standard error
	and execution is halted with an error-code 0.

	@param fpath: (string) the path to the knowledge base file.
	@return: (tuple) containing 2 dictionaries. The first contains regexp
	search patterns used to identify preprint references in a line. This
	dictionary is keyed by a tuple containing the line number of the
	pattern in the KB and the non-standard category string.
	E.g.: (3, 'ASTRO PH').
	The second dictionary contains the standardised category string,
	and is keyed by the non-standard category string. E.g.: 'astro-ph'.
	"""
	def _add_institute_preprint_patterns(preprint_classifications,
	preprint_numeration_ptns,
	preprint_reference_search_regexp_patterns,
	standardised_preprint_reference_categories,
	kb_line_num):
	"""For a list of preprint category strings and preprint numeration
	patterns for a given institute, create the regexp patterns for
	each of the preprint types. Add the regexp patterns to the
	dictionary of search patterns
	(preprint_reference_search_regexp_patterns), keyed by the line
	number of the institute in the KB, and the preprint category
	search string. Also add the standardised preprint category string
	to another dictionary, keyed by the line number of its position
	in the KB and its non-standardised version.
	@param preprint_classifications: (list) of tuples whereby each tuple
	contains a preprint category search string and the line number of
	the name of institute to which it belongs in the KB.
	E.g.: (45, 'ASTRO PH').
	@param preprint_numeration_ptns: (list) of preprint reference
	numeration search patterns (strings)
	@param preprint_reference_search_regexp_patterns: (dictionary) of
	regexp patterns used to search in document lines.
	@param standardised_preprint_reference_categories: (dictionary)
	containing the standardised strings for preprint reference
	categories. (E.g. 'astro-ph'.)
	@param kb_line_num: (integer) - the line number int the KB at
	which a given institute name was found.
	@return: None
	"""
	if preprint_classifications and preprint_numeration_ptns:
	# the previous institute had both numeration styles and categories
	# for preprint references.
	# build regexps and add them for this institute:
	# First, order the numeration styles by line-length, and build a
	# grouped regexp for recognising numeration:
	ordered_patterns = \
	order_reportnum_patterns_bylen(preprint_numeration_ptns)
	# create a grouped regexp for numeration part of
	# preprint reference:
	numeration_regexp = \
	create_institute_numeration_group_regexp_pattern(ordered_patterns)

	# for each "classification" part of preprint references, create a
	# complete regex:
	# will be in the style "(categ)-(numatn1\|numatn2\|numatn3\|...)"
	for classification in preprint_classifications:
	search_pattern_str = ur'(?:^\|[^a-zA-Z0-9\/\.\-])((?P<categ>' \
	+ classification[0].strip() + u')' \
	+ numeration_regexp + u')'

	re_search_pattern = re.compile(search_pattern_str,
	re.UNICODE)
	preprint_reference_search_regexp_patterns[(kb_line_num,
	classification[0])] =\
	re_search_pattern
	standardised_preprint_reference_categories[(kb_line_num,
	classification[0])] =\
	classification[1]

	preprint_reference_search_regexp_patterns = {} # a dictionary of patterns
	# used to recognise
	# categories of preprints
	# as used by various
	# institutes
	standardised_preprint_reference_categories = {} # dictionary of
	# standardised category
	# strings for preprint cats
	current_institute_preprint_classifications = [] # list of tuples containing
	# preprint categories in
	# their raw & standardised
	# forms, as read from KB
	current_institute_numerations = [] # list of preprint
	# numeration patterns, as
	# read from the KB

	# pattern to recognise an institute name line in the KB
	re_institute_name = re.compile(ur'^\{5}\s(.+)\s\{5}$', re.UNICODE)

	# pattern to recognise an institute preprint categ line in the KB
	re_preprint_classification = \
	re.compile(ur'^\s(\w.)\s---\s(\w.)\s$', re.UNICODE)

	# pattern to recognise a preprint numeration-style line in KB
	re_numeration_pattern = re.compile(ur'^\<(.+)\>$', re.UNICODE)

	kb_line_num = 0 # when making the dictionary of patterns, which is
	# keyed by the category search string, this counter
	# will ensure that patterns in the dictionary are not
	# overwritten if 2 institutes have the same category
	# styles.

	try:
	if isinstance(fpath, basestring):
	write_message('Loading reports kb from %s' % fpath, verbose=3)
	fh = open(fpath, "r")
	fpath_needs_closing = True
	else:
	fpath_needs_closing = False
	fh = fpath

	for rawline in fh:
	if rawline.startswith('#'):
	continue

	kb_line_num += 1
	try:
	rawline = rawline.decode("utf-8")
	except UnicodeError:
	- write_message("*** Unicode problems in %s for line %e" \
	+ write_message("*** Unicode problems in %s for line %e"
	% (fpath, kb_line_num), sys.stderr, verbose=0)
	raise UnicodeError("Error: Unable to parse report number kb (line: %s)" % str(kb_line_num))

	m_institute_name = re_institute_name.search(rawline)
	if m_institute_name:
	# This KB line is the name of an institute
	# append the last institute's pattern list to the list of
	# institutes:
	_add_institute_preprint_patterns(current_institute_preprint_classifications,
	current_institute_numerations,
	preprint_reference_search_regexp_patterns,
	standardised_preprint_reference_categories,
	kb_line_num)

	# Now start a new dictionary to contain the search patterns
	# for this institute:
	current_institute_preprint_classifications = []
	current_institute_numerations = []
	# move on to the next line
	continue

	m_preprint_classification = \
	re_preprint_classification.search(rawline)
	if m_preprint_classification:
	# This KB line contains a preprint classification for
	# the current institute
	try:
	current_institute_preprint_classifications.append((m_preprint_classification.group(1),
	m_preprint_classification.group(2)))
	except (AttributeError, NameError):
	# didn't match this line correctly - skip it
	pass
	# move on to the next line
	continue

	m_numeration_pattern = re_numeration_pattern.search(rawline)
	if m_numeration_pattern:
	# This KB line contains a preprint item numeration pattern
	# for the current institute
	try:
	current_institute_numerations.append(m_numeration_pattern.group(1))
	except (AttributeError, NameError):
	# didn't match the numeration pattern correctly - skip it
	pass
	continue

	_add_institute_preprint_patterns(current_institute_preprint_classifications,
	current_institute_numerations,
	preprint_reference_search_regexp_patterns,
	standardised_preprint_reference_categories,
	kb_line_num)
	if fpath_needs_closing:
	write_message('Loaded reports kb', verbose=3)
	fh.close()
	except IOError:
	# problem opening KB for reading, or problem while reading from it:
	emsg = """Error: Could not build knowledge base containing """ \
	"""institute preprint referencing patterns - failed """ \
	"""to read from KB %(kb)s.""" \
	% {'kb' : fpath}
	write_message(emsg, sys.stderr, verbose=0)
	raise IOError("Error: Unable to open report number kb '%s'" % fpath)

	# return the preprint reference patterns and the replacement strings
	# for non-standard categ-strings:
	- return (preprint_reference_search_regexp_patterns, \
	+ return (preprint_reference_search_regexp_patterns,
	standardised_preprint_reference_categories)


	def _cmp_bystrlen_reverse(a, b):
	"""A private "cmp" function to be used by the "sort" function of a
	list when ordering the titles found in a knowledge base by string-
	length - LONGEST -> SHORTEST.
	@param a: (string)
	@param b: (string)
	@return: (integer) - 0 if len(a) == len(b); 1 if len(a) < len(b);
	-1 if len(a) > len(b);
	"""
	if len(a) > len(b):
	return -1
	elif len(a) < len(b):
	return 1
	else:
	return 0


	def build_special_journals_kb(fpath):
	"""Load special journals database from file

	Special journals are journals that have a volume which is not unique
	among different years. To keep the volume unique we are adding the year
	before the volume.
	"""
	journals = set()
	write_message('Loading special journals kb from %s' % fpath, verbose=3)
	fh = open(fpath, "r")
	try:
	for line in fh:
	# Skip commented lines
	if line.startswith('#'):
	continue
	# Skip empty line
	if not line.strip():
	continue
	journals.add(line.strip())
	finally:
	fh.close()
	write_message('Loaded special journals kb', verbose=3)

	return journals


	def build_books_kb(fpath):
	if isinstance(fpath, basestring):
	fpath_needs_closing = True
	try:
	write_message('Loading books kb from %s' % fpath, verbose=3)
	fh = open(fpath, "r")
	source = csv.reader(fh, delimiter='\|', lineterminator=';')
	except IOError:
	# problem opening KB for reading, or problem while reading from it:
	emsg = "Error: Could not build list of books - failed " \
	"to read from KB %(kb)s." % {'kb' : fpath}
	raise IOError(emsg)
	else:
	fpath_needs_closing = False
	source = fpath

	try:
	books = {}
	for line in source:
	try:
	books[line[1].upper()] = line
	except IndexError:
	write_message('Invalid line in books kb %s' % line, verbose=1)
	finally:
	if fpath_needs_closing:
	fh.close()
	write_message('Loaded books kb', verbose=3)

	return books


	def build_publishers_kb(fpath):
	if isinstance(fpath, basestring):
	fpath_needs_closing = True
	try:
	write_message('Loading publishers kb from %s' % fpath, verbose=3)
	fh = open(fpath, "r")
	source = csv.reader(fh, delimiter='\|', lineterminator='\n')
	except IOError:
	# problem opening KB for reading, or problem while reading from it:
	emsg = "Error: Could not build list of publishers - failed " \
	"to read from KB %(kb)s." % {'kb' : fpath}
	raise IOError(emsg)
	else:
	fpath_needs_closing = False
	source = fpath

	try:
	publishers = {}
	for line in source:
	try:
	- publishers[line[0]] = line[1]
	+ pattern = re.compile(ur'(\b\|^)%s(\b\|$)' % line[0], re.I\|re.U)
	+ publishers[line[0]] = {'pattern': pattern, 'repl': line[1]}
	except IndexError:
	write_message('Invalid line in books kb %s' % line, verbose=1)
	finally:
	if fpath_needs_closing:
	fh.close()
	write_message('Loaded publishers kb', verbose=3)

	return publishers


	def build_authors_kb(fpath):
	replacements = []

	if isinstance(fpath, basestring):
	fpath_needs_closing = True
	try:
	fh = open(fpath, "r")
	except IOError:
	# problem opening KB for reading, or problem while reading from it:
	emsg = "Error: Could not build list of authors - failed " \
	"to read from KB %(kb)s." % {'kb' : fpath}
	write_message(emsg, sys.stderr, verbose=0)
	raise IOError("Error: Unable to open authors kb '%s'" % fpath)
	else:
	fpath_needs_closing = False
	fh = fpath

	try:
	for rawline in fh:
	if rawline.startswith('#'):
	continue

	# Extract the seek->replace terms from this KB line:
	m_kb_line = re_kb_line.search(rawline.decode('utf-8'))
	if m_kb_line:
	seek = m_kb_line.group('seek')
	repl = m_kb_line.group('repl')
	replacements.append((seek, repl))
	finally:
	if fpath_needs_closing:
	fh.close()

	return replacements


	def build_journals_re_kb(fpath):
	"""Load journals regexps knowledge base

	@see build_journals_kb
	"""
	def make_tuple(match):
	- regexp = re.compile(match.group('seek'), re.UNICODE)
	- repl = '<cds.JOURNAL>%s</cds.JOURNAL>' % match.group('repl')
	- return (regexp, repl)
	+ regexp = match.group('seek')
	+ repl = match.group('repl')
	+ return regexp, repl

	kb = []

	if isinstance(fpath, basestring):
	fpath_needs_closing = True
	try:
	fh = open(fpath, "r")
	except IOError:
	raise IOError("Error: Unable to open journal kb '%s'" % fpath)
	else:
	fpath_needs_closing = False
	fh = fpath

	try:
	for rawline in fh:
	if rawline.startswith('#'):
	continue
	# Extract the seek->replace terms from this KB line:
	m_kb_line = re_kb_line.search(rawline.decode('utf-8'))
	kb.append(make_tuple(m_kb_line))
	finally:
	if fpath_needs_closing:
	fh.close()

	return kb


	def load_kb_from_iterable(kb, builder):
	return builder(kb)


	def load_kb_from_file(path, builder):
	try:
	fh = open(path, "r")
	except IOError, e:
	raise StandardError("Unable to open kb '%s': %s" % (path, e))

	def lazy_parser(fh):
	for rawline in fh:
	if rawline.startswith('#'):
	continue

	try:
	rawline = rawline.decode("utf-8").rstrip("\n")
	except UnicodeError:
	- raise StandardError("Unicode problems in kb %s at line %s" \
	+ raise StandardError("Unicode problems in kb %s at line %s"
	% (path, rawline))

	# Test line to ensure that it is a correctly formatted
	# knowledge base line:
	# Extract the seek->replace terms from this KB line
	m_kb_line = re_kb_line.search(rawline)
	if m_kb_line: # good KB line
	yield m_kb_line.group('seek'), m_kb_line.group('repl')
	else:
	- raise StandardError("Badly formatted kb '%s' at line %s" \
	+ raise StandardError("Badly formatted kb '%s' at line %s"
	% (path, rawline))

	try:
	return builder(lazy_parser(fh))
	finally:
	fh.close()


	def load_kb_from_db(kb_name, builder):
	def lazy_parser(kb):
	for mapping in kb:
	yield mapping['key'], mapping['value']

	return builder(lazy_parser(get_kbr_items(kb_name)))


	def build_journals_kb(knowledgebase):
	"""Given the path to a knowledge base file, read in the contents
	of that file into a dictionary of search->replace word phrases.
	The search phrases are compiled into a regex pattern object.
	The knowledge base file should consist only of lines that take
	the following format:
	seek-term --- replace-term
	(i.e. a seek phrase on the left hand side, a replace phrase on
	the right hand side, with the two phrases being separated by 3
	hyphens.) E.g.:
	ASTRONOMY AND ASTROPHYSICS ---Astron. Astrophys.

	The left-hand side term is a non-standard version of the title,
	whereas the right-hand side term is the standard version.
	If the KB file cannot be read from, or an unexpected line is
	encountered in the KB, an error
	message is output to standard error and execution is halted with
	an error-code 0.

	@param fpath: (string) the path to the knowledge base file.
	@return: (tuple) containing a list and a dictionary. The list
	contains compiled regex patterns used as search terms and will
	be used to force searching order to match that of the knowledge
	base.
	The dictionary contains the search->replace terms. The keys of
	the dictionary are the compiled regex word phrases used for
	searching in the reference lines; The values in the dictionary are
	the replace terms for matches.
	"""
	# Initialise vars:
	# dictionary of search and replace phrases from KB:
	kb = {}
	standardised_titles = {}
	seek_phrases = []
	# A dictionary of "replacement terms" (RHS) to be inserted into KB as
	# "seek terms" later, if they were not already explicitly added
	# by the KB:
	repl_terms = {}

	write_message('Processing journals kb', verbose=3)
	for seek_phrase, repl in knowledgebase:
	# good KB line
	# Add the 'replacement term' into the dictionary of
	# replacement terms:
	repl_terms[repl] = None

	# add the phrase from the KB if the 'seek' phrase is longer
	# compile the seek phrase into a pattern:
	seek_ptn = re.compile(ur'(?<!\w)(%s)\W' % re.escape(seek_phrase),
	re.UNICODE)

	kb[seek_phrase] = seek_ptn
	standardised_titles[seek_phrase] = repl
	seek_phrases.append(seek_phrase)

	# Now, for every 'replacement term' found in the KB, if it is
	# not already in the KB as a "search term", add it:
	for repl_term in repl_terms.keys():
	raw_repl_phrase = repl_term.upper()
	raw_repl_phrase = re_punctuation.sub(u' ', raw_repl_phrase)
	raw_repl_phrase = \
	re_group_captured_multiple_space.sub(u' ', raw_repl_phrase)
	raw_repl_phrase = raw_repl_phrase.strip()
	if raw_repl_phrase not in kb:
	# The replace-phrase was not in the KB as a seek phrase
	# It should be added.
	- seek_ptn = re.compile(r'(?<!\/)\b(' + \
	- re.escape(raw_repl_phrase) + \
	- r')[^A-Z0-9]', re.UNICODE)
	+ pattern = ur'(?<!\/)\b(%s)[^A-Z0-9]' % re.escape(raw_repl_phrase)
	+ seek_ptn = re.compile(pattern, re.U)
	kb[raw_repl_phrase] = seek_ptn
	standardised_titles[raw_repl_phrase] = repl_term
	seek_phrases.append(raw_repl_phrase)

	# Sort the titles by string length (long - short)
	seek_phrases.sort(_cmp_bystrlen_reverse)

	write_message('Processed journals kb', verbose=3)

	# return the raw knowledge base:
	- return (kb, standardised_titles, seek_phrases)
	+ return kb, standardised_titles, seek_phrases
	+
	+
	+def build_collaborations_kb(knowledgebase):
	+ kb = {}
	+ for pattern, collab in knowledgebase:
	+ prefix = ur"(?:^\|[\(\"\[\s]\|(?<=\W))\s*(?:(?:the\|and)\s+)?"
	+ collaboration_pattern = ur"(?:\s*coll(?:aborations?\|\.)?)?"
	+ suffix = ur"(?=$\|[><\]\)\"\s.,:])"
	+ pattern = pattern.replace(' ', '\s')
	+ pattern = pattern.replace('Collaboration', collaboration_pattern)
	+ re_pattern = "%s(%s)%s" % (prefix, pattern, suffix)
	+ kb[collab] = re.compile(re_pattern, re.I\|re.U)
	+ return kb
	diff --git a/modules/docextract/lib/refextract_linker.py b/modules/docextract/lib/refextract_linker.py
	index 05327c618..c455c1954 100644
	--- a/modules/docextract/lib/refextract_linker.py
	+++ b/modules/docextract/lib/refextract_linker.py
	@@ -1,75 +1,79 @@
	# -- coding: utf-8 --
	##
	## This file is part of Invenio.
	## Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010, 2011 CERN.
	##
	## Invenio is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## Invenio is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	-from invenio.bibrank_citation_indexer import INTBITSET_OF_DELETED_RECORDS
	+from invenio.bibrank_citation_indexer import INTBITSET_OF_DELETED_RECORDS, \
	+ standardize_report_number
	from invenio.bibindex_tokenizers.BibIndexJournalTokenizer import \
	CFG_JOURNAL_PUBINFO_STANDARD_FORM
	from invenio.search_engine import search_pattern


	def get_recids_matching_query(pvalue, fvalue):
	"""Return list of recIDs matching query for PVALUE and FVALUE."""
	- recids = search_pattern(p=pvalue, f=fvalue, m='e')
	+ recids = search_pattern(p=pvalue.encode('utf-8'), f=fvalue, m='e')
	recids -= INTBITSET_OF_DELETED_RECORDS
	return list(recids)


	def format_journal(format_string, mappings):
	"""format the publ infostring according to the format"""

	def replace(char, data):
	return data.get(char, char)

	- return ''.join(replace(c, mappings) for c in format_string)
	+ for c in mappings.keys():
	+ format_string = format_string.replace(c, replace(c, mappings))
	+
	+ return format_string


	def find_journal(citation_element):
	tags_values = {
	'773__p': citation_element['title'],
	'773__v': citation_element['volume'],
	'773__c': citation_element['page'],
	'773__y': citation_element['year'],
	}
	journal_string \
	= format_journal(CFG_JOURNAL_PUBINFO_STANDARD_FORM, tags_values)
	return get_recids_matching_query(journal_string, 'journal')


	def find_reportnumber(citation_element):
	- reportnumber_string = citation_element['report_num']
	- return get_recids_matching_query(reportnumber_string, 'reportnumber')
	+ reportnumber = standardize_report_number(citation_element['report_num'])
	+ return get_recids_matching_query(reportnumber, 'reportnumber')


	def find_doi(citation_element):
	doi_string = citation_element['doi_string']
	return get_recids_matching_query(doi_string, 'doi')


	def find_referenced_recid(citation_element):
	el_type = citation_element['type']
	if el_type in FINDERS:
	return FINDERS[el_type](citation_element)
	return []


	FINDERS = {
	'JOURNAL': find_journal,
	'REPORTNUMBER': find_reportnumber,
	'DOI': find_doi,
	}
	diff --git a/modules/docextract/lib/refextract_re.py b/modules/docextract/lib/refextract_re.py
	index 211118e89..cbf93220b 100644
	--- a/modules/docextract/lib/refextract_re.py
	+++ b/modules/docextract/lib/refextract_re.py
	@@ -1,765 +1,841 @@
	# -- coding: utf-8 --
	##
	## This file is part of Invenio.
	## Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010, 2011 CERN.
	##
	## Invenio is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## Invenio is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	import re
	from datetime import datetime

	+# Sep
	+re_sep = ur"\s[,\s:-]\s"
	+# Sep or no sep
	+re_sep_opt = ur"\s[,\s:-]?\s"
	+
	# Pattern for PoS journal

	# e.g. 2006
	re_pos_year_num = ur'(?:19\|20)\d{2}'
	re_pos_year = ur'(?P<year>(' \
	+ ur'\s' + re_pos_year_num + ur'\s' \
	+ ur'\|' \
	+ ur'$' + re_pos_year_num + '$' \
	+ ur'))'
	# e.g. LAT2007
	-re_pos_volume = ur'(?P<volume>\w{1,10}(?:19\|20)\d{2})'
	+re_pos_volume = ur'(?P<volume_name>\w{1,10})' + re_sep_opt + ur'(?P<volume_num>(?:19\|20)\d{2})'
	# e.g. (LAT2007)
	re_pos_volume_par = ur'$' + re_pos_volume + ur'$'
	# e.g. 20
	re_pos_page = ur'(?P<page>\d{1,4})'
	re_pos_title = ur'POS'

	re_pos_patterns = [
	- re_pos_title + ur'\s*' + re_pos_year + ur'\s+' + re_pos_volume + ur'\s+' + re_pos_page,
	- re_pos_title + ur'\s+' + re_pos_volume + ur'\s' + re_pos_year + ur'\s' + re_pos_page,
	- re_pos_title + ur'\s+' + re_pos_volume + ur'\s+' + re_pos_page + ur'\s*' + re_pos_year,
	- re_pos_title + ur'\s' + re_pos_volume_par + ur'\s' + re_pos_page,
	+ re_pos_title + re_sep_opt + re_pos_year + re_sep + re_pos_volume + re_sep + re_pos_page,
	+ re_pos_title + re_sep + re_pos_volume + re_sep_opt + re_pos_year + re_sep_opt + re_pos_page,
	+ re_pos_title + re_sep + re_pos_volume + re_sep + re_pos_page + re_sep_opt + re_pos_year,
	+ re_pos_title + re_sep_opt + re_pos_volume_par + re_sep_opt + re_pos_page,
	]
	re_opts = re.VERBOSE \| re.UNICODE \| re.IGNORECASE


	def compute_pos_patterns(patterns):
	return [re.compile(p, re_opts) for p in patterns]
	re_pos = compute_pos_patterns(re_pos_patterns)

	# Pattern for arxiv numbers
	-re_arxiv = re.compile(ur""" # arxiv 9910-1234v9 [physics.ins-det]
	+# arxiv 9910-1234v9 [physics.ins-det]
	+re_arxiv = re.compile(ur"""
	ARXIV[\s:-]*(?P<year>\d{2})-?(?P<month>\d{2})
	[\s.-](?P<num>\d{4})(?:[\s-]V(?P<version>\d))?
	\s*(?P<suffix>\[[A-Z.-]+\])? """, re.VERBOSE \| re.UNICODE \| re.IGNORECASE)

	+# Pattern for arxiv numbers catchup
	+# arxiv:9910-123 [physics.ins-det]
	+RE_ARXIV_CATCHUP = re.compile(ur"""
	+ ARXIV[\s:-]*(?P<year>\d{2})-?(?P<month>\d{2})
	+ [\s.-]*(?P<num>\d{3})
	+ \s*\[(?P<suffix>[A-Z.-]+)\]""", re.VERBOSE \| re.UNICODE \| re.IGNORECASE)
	+
	+# Patterns for ATLAS CONF report numbers
	+RE_ATLAS_CONF_PRE_2010 = re.compile(
	+ ur'(?<!\w:)ATL(AS)?-CONF-(?P<code>(?:200\d\|99)-\d{3})(?![\w\d])')
	+RE_ATLAS_CONF_POST_2010 = re.compile(
	+ ur'(?<!\w:)ATL(AS)?-CONF-(?P<code>20[1-9]\d-\d{3})(?![\w\d])')
	+
	+
	# Pattern for old arxiv numbers
	-old_arxiv_numbers = ur"(?P<num>/\d{7})"
	+old_arxiv_numbers = ur"[\\|/:\s-]?(?P<num>(?:9[1-9]\|0[0-7])(?:0[1-9]\|1[0-2])\d{3})(?:v\d{1,3})?(?=[^\w\d]\|$)"
	+
	old_arxiv = {
	ur"acc-ph": None,
	ur"astro-ph": None,
	ur"astro-phy": "astro-ph",
	ur"astro-ph\.[a-z]{2}": None,
	ur"atom-ph": None,
	ur"chao-dyn": None,
	ur"chem-ph": None,
	ur"cond-mat": None,
	ur"cs": None,
	ur"cs\.[a-z]{2}": None,
	ur"gr-qc": None,
	ur"hep-ex": None,
	ur"hep-lat": None,
	ur"hep-ph": None,
	ur"hepph": "hep-ph",
	ur"hep-th": None,
	ur"hepth": "hep-th",
	ur"math": None,
	ur"math\.[a-z]{2}": None,
	ur"math-ph": None,
	ur"nlin": None,
	ur"nlin\.[a-z]{2}": None,
	ur"nucl-ex": None,
	ur"nucl-th": None,
	ur"physics": None,
	ur"physics\.acc-ph": None,
	ur"physics\.ao-ph": None,
	ur"physics\.atm-clus": None,
	ur"physics\.atom-ph": None,
	ur"physics\.bio-ph": None,
	ur"physics\.chem-ph": None,
	ur"physics\.class-ph": None,
	ur"physics\.comp-ph": None,
	ur"physics\.data-an": None,
	ur"physics\.ed-ph": None,
	ur"physics\.flu-dyn": None,
	ur"physics\.gen-ph": None,
	ur"physics\.geo-ph": None,
	ur"physics\.hist-ph": None,
	ur"physics\.ins-det": None,
	ur"physics\.med-ph": None,
	ur"physics\.optics": None,
	ur"physics\.plasm-ph": None,
	ur"physics\.pop-ph": None,
	ur"physics\.soc-ph": None,
	ur"physics\.space-ph": None,
	- ur"plasm-ph": "physics\.plasm-ph",
	+ ur"plasm-ph": "physics.plasm-ph",
	ur"q-bio\.[a-z]{2}": None,
	ur"q-fin\.[a-z]{2}": None,
	ur"q-alg": None,
	ur"quant-ph": None,
	ur"quant-phys": "quant-ph",
	ur"solv-int": None,
	ur"stat\.[a-z]{2}": None,
	ur"stat-mech": None,
	+ ur"dg-ga": None,
	+ ur"hap-ph": "hep-ph",
	+ ur"funct-an": None,
	+ ur"quantph": "quant-ph",
	+ ur"stro-ph": "astro-ph",
	+ ur"hepex": "hep-ex",
	+ ur"math-ag": "math.ag",
	+ ur"math-dg": "math.dg",
	+ ur"nuc-th": "nucl-th",
	+ ur"math-ca": "math.ca",
	+ ur"nlin-si": "nlin.si",
	+ ur"quantum-ph": "quant-ph",
	+ ur"ep-ph": "hep-ph",
	+ ur"ep-th": "hep-ph",
	+ ur"ep-ex": "hep-ex",
	+ ur"hept-h": "hep-th",
	+ ur"hepp-h": "hep-ph",
	+ ur"physi-cs": "physics",
	+ ur"asstro-ph": "astro-ph",
	+ ur"hep-lt": "hep-lat",
	+ ur"he-ph": "hep-ph",
	+ ur"het-ph": "hep-ph",
	+ ur"mat-ph": "math.th",
	+ ur"math-th": "math.th",
	+ ur"ucl-th": "nucl-th",
	+ ur"nnucl-th": "nucl-th",
	+ ur"nuclt-th": "nucl-th",
	+ ur"atro-ph": "astro-ph",
	+ ur"qnant-ph": "quant-ph",
	+ ur"astr-ph": "astro-ph",
	+ ur"math-qa": "math.qa",
	+ ur"tro-ph": "astro-ph",
	+ ur"hucl-th": "nucl-th",
	+ ur"math-gt": "math.gt",
	+ ur"math-nt": "math.nt",
	+ ur"math-ct": "math.ct",
	+ ur"math-oa": "math.oa",
	+ ur"math-sg": "math.sg",
	+ ur"math-ap": "math.ap",
	+ ur"quan-ph": "quant-ph",
	+ ur"nlin-cd": "nlin.cd",
	+ ur"math-sp": "math.sp",
	+ ur"atro-ph": "astro-ph",
	+ ur"ast-ph": "astro-ph",
	+ ur"asyro-ph": "astro-ph",
	+ ur"aastro-ph": "astro-ph",
	+ ur"astrop-ph": "astro-ph",
	+ ur"arxivastrop-ph": "astro-ph",
	+ ur"hept-th": "hep-th",
	+ ur"quan-th": "quant-th",
	+ ur"asro-ph": "astro-ph",
	+ ur"castro-ph": "astro-ph",
	+ ur"asaastro-ph": "astro-ph",
	+ ur"hhep-ph": "hep-ph",
	+ ur"hhep-ex": "hep-ex",
	+ ur"alg-geom": None,
	+ ur"nuclth": "nucl-th",
	}


	def compute_arxiv_re(report_pattern, report_number):
	if report_number is None:
	report_number = ur"\g<name>"
	- report_re = re.compile("(?P<name>" + report_pattern + ")" \
	+ report_re = re.compile(ur"(?<!<cds\.REPORTNUMBER>)(?<!\w)" \
	+ + "(?P<name>" + report_pattern + ")" \
	+ old_arxiv_numbers, re.U\|re.I)
	return report_re, report_number

	RE_OLD_ARXIV = [compute_arxiv_re(*i) for i in old_arxiv.iteritems()]


	def compute_years():
	current_year = datetime.now().year
	return '\|'.join(str(y)[2:] for y in xrange(1991, current_year + 1))
	arxiv_years = compute_years()


	def compute_months():
	return '\|'.join(str(y).zfill(2) for y in xrange(1, 13))
	arxiv_months = compute_months()

	re_new_arxiv = re.compile(ur""" # 9910.1234v9 [physics.ins-det]
	(?<!ARXIV:)
	(?P<year>%(arxiv_years)s)
	(?P<month>%(arxiv_months)s)
	\.(?P<num>\d{4})(?:[\s-]*V(?P<version>\d))?
	\s*(?P<suffix>\[[A-Z.-]+\])? """ % {'arxiv_years': arxiv_years,
	'arxiv_months': arxiv_months}, re.VERBOSE \| re.UNICODE \| re.IGNORECASE)

	# Pattern to recognize quoted text:
	re_quoted = re.compile(ur'"(?P<title>[^"]+)"', re.UNICODE)

	# Pattern to recognise an ISBN for a book:
	re_isbn = re.compile(ur"""
	(?:ISBN[-– ]*(?:\|10\|13)\|International Standard Book Number)
	[:\s]*
	(?P<code>[-\-–0-9Xx]{10,25})""", re.VERBOSE \| re.UNICODE)

	# Pattern to recognise a correct knowledge base line:
	re_kb_line = re.compile(ur'^\s(?P<seek>[^\s].)\s---\s(?P<repl>[^\s].)\s$',
	re.UNICODE)

	# precompile some often-used regexp for speed reasons:
	re_regexp_character_class = re.compile(ur'\[[^\]]+\]', re.UNICODE)
	re_multiple_hyphens = re.compile(ur'-{2,}', re.UNICODE)


	# In certain papers, " bf " appears just before the volume of a
	# cited item. It is believed that this is a mistyped TeX command for
	# making the volume "bold" in the paper.
	# The line may look something like this after numeration has been recognised:
	# M. Bauer, B. Stech, M. Wirbel, Z. Phys. bf C : <cds.VOL>34</cds.VOL>
	# <cds.YR>(1987)</cds.YR> <cds.PG>103</cds.PG>
	# The " bf " stops the title from being correctly linked with its series
	# and/or numeration and thus breaks the citation.
	# The pattern below is used to identify this situation and remove the
	# " bf" component:
	re_identify_bf_before_vol = \
	re.compile(ur' bf ((\w )?: \<cds\.VOL\>)', \
	re.UNICODE)

	# Patterns used for creating institutional preprint report-number
	# recognition patterns (used by function "institute_num_pattern_to_regex"):
	# Recognise any character that isn't a->z, A->Z, 0->9, /, [, ], ' ', '"':
	re_report_num_chars_to_escape = \
	re.compile(ur'([^\]A-Za-z0-9\/\[ "])', re.UNICODE)
	# Replace "hello" with hello:
	re_extract_quoted_text = (re.compile(ur'\"([^"]+)\"', re.UNICODE), ur'\g<1>',)
	# Replace / [abcd ]/ with /( [abcd])?/ :
	re_extract_char_class = (re.compile(ur' \[([^\]]+) \]', re.UNICODE), \
	ur'( [\g<1>])?')


	# URL recognition:
	+raw_url_pattern = ur"""
	+ (https?\|s?ftp)://(?:[\w\d_.-])+(?::\d{1,5})?
	+ (?:/[\w\d_.?=&%~∼-]+)*/?
	+"""
	# Stand-alone URL (e.g. http://invenio-software.org/ )
	re_raw_url = \
	- re.compile(ur"""\"?
	- (
	- (https?\|s?ftp):\/\/([\w\d\_\.\-])+(:\d{1,5})?
	- (\/\~([\w\d\_\.\-])+)?
	- (\/([\w\d\_\.\-\?\=\&])+)*
	- (\/([\w\d\_\-]+\.\w{1,6})?)?
	- )
	- \"?""", re.UNICODE\|re.I\|re.VERBOSE)
	+ re.compile("['\"]?(?P<url>" + raw_url_pattern + ")['\"]?",
	+ re.UNICODE\|re.I\|re.VERBOSE)

	# HTML marked-up URL (e.g. <a href="http://invenio-software.org/">
	# CERN Document Server Software Consortium</a> )
	re_html_tagged_url = \
	- re.compile(ur"""(\<a\s+href\s=\s([\'"])?
	- (((https?\|s?ftp):\/\/)?([\w\d\_\.\-])+(:\d{1,5})?
	- (\/\~([\w\d\_\.\-])+)?(\/([\w\d\_\.\-\?\=])+)*
	- (\/([\w\d\_\-]+\.\w{1,6})?)?)([\'"])?\>
	- ([^\<]+)
	- \<\/a\>)""", re.UNICODE\|re.I\|re.VERBOSE)
	+ re.compile(ur"""
	+ # Opening a tag
	+ <a\s+
	+ # href attribute
	+ href\s=\s[\'"]
	+ # href value
	+ (?P<url>""" + raw_url_pattern + ur""")
	+ # href closing quote
	+ ['"]\s*>
	+ # Tag content
	+ (?P<desc>[^\<]+)
	+ # Closing a tag
	+ </a>""", re.UNICODE\|re.I\|re.VERBOSE)


	# Numeration recognition pattern - used to identify numeration
	# associated with a title when marking the title up into MARC XML:
	vol_tag = ur'<cds\.VOL\>(?P<vol>[^<]+)<\/cds\.VOL>'
	year_tag = ur'\<cds\.YR\>$(?P<yr>[^<]+)$\<\/cds\.YR\>'
	series_tag = ur'(?P<series>(?:[A-H]\|I{1,3}V?\|VI{0,3}))?'
	page_tag = ur'\<cds\.PG\>(?P<pg>[^<]+)\<\/cds\.PG\>'
	re_recognised_numeration_for_title_plus_series = re.compile(
	ur'^\s[\.,]?\s(?:Ser\.\s)?' + series_tag + ur'\s:?\s*' + vol_tag +
	u'\s(?: ' + year_tag + u')?\s(?: ' + page_tag + u')', re.UNICODE)

	# Another numeration pattern. This one is designed to match marked-up
	# numeration that is essentially an IBID, but without the word "IBID". E.g.:
	# <cds.JOURNAL>J. Phys. A</cds.JOURNAL> : <cds.VOL>31</cds.VOL>
	# <cds.YR>(1998)</cds.YR> <cds.PG>2391</cds.PG>; : <cds.VOL>32</cds.VOL>
	# <cds.YR>(1999)</cds.YR> <cds.PG>6119</cds.PG>.
	re_numeration_no_ibid_txt = \
	re.compile(ur"""
	^((\s;\s\|\s+and\s+)(?P<series>(?:[A-H]\|I{1,3}V?\|VI{0,3}))?\s*:?\s ## Leading ; : or " and :", and a possible series letter
	\<cds\.VOL\>(?P<vol>\d+\|(?:\d+\-\d+))\<\/cds\.VOL>\s ## Volume
	\<cds\.YR\>$(?P<yr>[12]\d{3})$\<\/cds\.YR\>\s ## year
	\<cds\.PG\>(?P<pg>[RL]?\d+[c]?)\<\/cds\.PG\>) ## page
	""", re.UNICODE\|re.VERBOSE)

	re_title_followed_by_series_markup_tags = \
	re.compile(ur'(\<cds.JOURNAL(?P<ibid>ibid)?\>([^\<]+)\<\/cds.JOURNAL(?:ibid)?\>\s.?\s\<cds\.SER\>([A-H]\|(I{1,3}V?\|VI{0,3}))\<\/cds\.SER\>)', re.UNICODE)

	re_title_followed_by_implied_series = \
	re.compile(ur'(\<cds.JOURNAL(?P<ibid>ibid)?\>([^\<]+)\<\/cds.JOURNAL(?:ibid)?\>\s.?\s([A-H]\|(I{1,3}V?\|VI{0,3}))\s+:)', re.UNICODE)


	re_punctuation = re.compile(ur'[\.\,\;\'\-]', re.UNICODE)

	# The following pattern is used to recognise "citation items" that have been
	# identified in the line, when building a MARC XML representation of the line:
	re_tagged_citation = re.compile(ur"""
	\<cds\. ## open tag: <cds.
	((?:JOURNAL(?P<ibid>ibid)?) ## a JOURNAL tag
	\|VOL ## or a VOL tag
	\|YR ## or a YR tag
	\|PG ## or a PG tag
	\|REPORTNUMBER ## or a REPORTNUMBER tag
	\|SER ## or a SER tag
	\|URL ## or a URL tag
	\|DOI ## or a DOI tag
	\|QUOTED ## or a QUOTED tag
	\|ISBN ## or a ISBN tag
	\|PUBLISHER ## or a PUBLISHER tag
	+ \|COLLABORATION ## or a COLLABORATION tag
	\|AUTH(stnd\|etal\|incl)) ## or an AUTH tag
	(\s\/)? ## optional /
	\> ## closing of tag (>)
	""", re.UNICODE\|re.VERBOSE)


	# is there pre-recognised numeration-tagging within a
	# few characters of the start if this part of the line?
	re_tagged_numeration_near_line_start = \
	re.compile(ur'^.{0,4}?<CDS (VOL\|SER)>', re.UNICODE)

	-re_ibid = \
	- re.compile(ur'(-\|\b)?((?:(?:IBID(?!EM))\|(?:IBIDEM))\.?(\s([A-H]\|(I{1,3}V?\|VI{0,3})\|[1-3]))?)\s?', \
	- re.UNICODE)
	-
	-re_matched_ibid = re.compile(ur'(?:(?:IBID(?!EM))\|(?:IBIDEM))(?:[\.,]{0,2}\s*\|\s+)([A-H]\|(I{1,3}V?\|VI{0,3})\|[1-3])?', \
	- re.UNICODE)
	+re_ibid = re.compile(ur'(-\|\b)?IBID(EM)?\.?', re.UNICODE)

	-re_series_from_numeration = re.compile(ur'^\.?,?\s+([A-H]\|(I{1,3}V?\|VI{0,3}))\s+:\s+', \
	- re.UNICODE)
	+re_series_from_numeration = re.compile(ur'^([A-Z])\s[,\s:-]?\s\d+', re.UNICODE)
	+re_series_from_numeration_after_volume = re.compile(ur'^\d+\s[,\s:-]?\s([A-Z])', re.UNICODE)

	# Obtain the series character from the standardised title text
	# Only used when no series letter is obtained from numeration matching
	re_series_from_title = re.compile(ur"""
	([^\s].*)
	(?:[\s\.]+(?:(?P<open_bracket>\()\s*[Ss][Ee][Rr]\.)?
	([A-H]\|(I{1,3}V?\|VI{0,3}))
	)?
	(?(open_bracket)\s*\))$ ## Only match the ending bracket if the opening bracket was found""", \
	re.UNICODE\|re.VERBOSE)


	re_wash_volume_tag = (
	re.compile(ur'<cds\.VOL>(\w) (\d+)</cds\.VOL>'),
	ur'<cds.VOL>\g<1>\g<2></cds.VOL>',
	)

	# Roman Numbers
	re_roman_numbers = ur"[XxVvIi]+"

	-# Sep
	-re_sep = ur"\s[,\s:-]\s"
	+# Possible beginnings of numeration
	+re_start = ur"\s[,\s:-]?\s"

	# Title tag
	re_title_tag = ur"(?P<title_tag><cds\.JOURNAL>[^<]*<\/cds\.JOURNAL>)"

	# Number (within a volume)
	re_volume_sub_number = ur'[Nn][oO°]\.?\s*\d{1,6}'
	re_volume_sub_number_opt = u'(?:' + re_sep + u'(?P<vol_sub>' + \
	re_volume_sub_number + u'))?'

	# Volume
	re_volume_prefix = ur"(?:[Vv]o?l?\.?\|[Nn][oO°]\.?)" # Optional Vol./No.
	re_volume_suffix = ur"(?:\s*$\d{1,2}(?:-\d)?$)?"
	re_volume_num = ur"\d+\|" + "(?:(?<!\w)" + re_roman_numbers + "(?!\w))"
	-re_volume_id = ur"(?P<vol>(?:(?:[A-Za-z]\s?)?(?P<vol_num>%(volume_num)s))\|(?:(?:%(volume_num)s)(?:[A-Za-z]))\|(?:(?:[A-Za-z]\s?)?\d+\s\-\s(?:[A-Za-z]\s?)?\d+))" % {'volume_num': re_volume_num}
	+re_volume_id = ur"(?P<vol>(?:(?:[A-Za-z]\s[,\s:-]?\s)?(?P<vol_num>%(volume_num)s))\|(?:(?P<vol_num_alt>%(volume_num)s)(?:[A-Za-z]))\|(?:(?:[A-Za-z]\s?)?(?P<vol_num_alt2>\d+)\s\-\s(?:[A-Za-z]\s?)?\d+))" % {'volume_num': re_volume_num}
	re_volume_check = ur"(?<![\/\d])"
	re_volume = ur"\b" + u"(?:" + re_volume_prefix + u")?\s*" + re_volume_check + \
	re_volume_id + re_volume_suffix

	# Month
	re_short_month = ur"""(?:(?:
	[Jj]an\|[Ff]eb\|[Mm]ar\|[Aa]pr\|[Mm]ay\|[Jj]un\|
	[Jj]ul\|[Aa]ug\|[Ss]ep\|[Oo]ct\|[Nn]ov\|[Dd]ec
	)\.?)"""

	re_month = ur"""(?:(?:
	[Jj]anuary\|[Ff]ebruary\|[Mm]arch\|[Aa]pril\|[Mm]ay\|[Jj]une\|
	[Jj]uly\|[Aa]ugust\|[Ss]eptember\|[Oo]ctober\|[Nn]ovember\|[Dd]ecember
	)\.?)"""

	# Year
	re_year_num = ur"(?:19\|20)\d{2}"
	re_year_text = u"(?P<year>[A-Za-z]?" + re_year_num + u")(?:[A-Za-z]?)"
	re_year = ur"""
	\(?
	(?:%(short_month)s[,\s]\s*)? # Jul, 1980
	(?:%(month)s[,\s]\s*)? # July, 1980
	(?<!\d)
	%(year)s
	(?!\d)
	\)?
	""" % {
	'year': re_year_text,
	'short_month': re_short_month,
	'month': re_month,
	}

	# Page
	re_page_prefix = ur"[pP]?[p]?\.?\s?" # Starting page num: optional Pp.
	re_page_num = ur"[RL]?\w?\d+[cC]?" # pagenum with optional R/L
	-re_page_sep = ur"\s[\s-]\s" # optional separatr between pagenums
	+re_page_sep = ur"\s-\s" # optional separator between pagenums
	re_page = re_page_prefix + \
	u"(?P<page>" + re_page_num + u")(?:" + re_page_sep + \
	u"(?P<page_end>" + re_page_num + u"))?"

	# Series
	re_series = ur"(?P<series>[A-H])"

	# Used for allowing 3(1991) without space
	re_look_ahead_parentesis = ur"(?=\()"
	re_sep_or_parentesis = u'(?:' + re_sep + u'\|' + re_look_ahead_parentesis + ')'

	re_look_behind_parentesis = ur"(?<=\))"
	re_sep_or_after_parentesis = u'(?:' + \
	re_sep + u'\|' + re_look_behind_parentesis + ')'


	# After having processed a line for titles, it may be possible to find more
	# numeration with the aid of the recognised titles. The following 2 patterns
	# are used for this:

	-re_correct_numeration_2nd_try_ptn1 = (re.compile(
	+re_correct_numeration_2nd_try_ptn1 = re.compile(
	re_year + re_sep + # Year
	- re_title_tag + re_sep + # Recognised, tagged title
	+ re_title_tag + # Recognised, tagged title
	+ u'(?P<aftertitle>' +
	+ re_sep +
	re_volume + re_sep + # The volume
	- re_page, # The page
	- re.UNICODE\|re.VERBOSE), ur'\g<title_tag> : <cds.VOL>\g<vol></cds.VOL> ' \
	- ur'<cds.YR>(\g<year>)</cds.YR> <cds.PG>\g<page></cds.PG>')
	+ re_page + # The page
	+ u')', re.UNICODE\|re.VERBOSE)

	-re_correct_numeration_2nd_try_ptn2 = (re.compile(
	+re_correct_numeration_2nd_try_ptn2 = re.compile(
	re_year + re_sep +
	- re_title_tag + re_sep +
	+ re_title_tag +
	+ u'(?P<aftertitle>' +
	+ re_sep +
	re_volume + re_sep +
	re_series + re_sep +
	- re_page, re.UNICODE\|re.VERBOSE),
	- ur'\g<title_tag> <cds.SER>\g<series></cds.SER> : ' \
	- ur'<cds.VOL>\g<vol></cds.VOL> ' \
	- ur'<cds.YR>(\g<year>)</cds.YR> ' \
	- ur'<cds.PG>\g<page></cds.PG>')
	-
	-re_correct_numeration_2nd_try_ptn3 = (re.compile(
	- re_title_tag + re_sep + # Recognised, tagged title
	+ re_page +
	+ u')', re.UNICODE\|re.VERBOSE)
	+
	+re_correct_numeration_2nd_try_ptn3 = re.compile(
	+ re_title_tag +
	+ u'(?P<aftertitle>' +
	+ re_sep + # Recognised, tagged title
	re_volume + re_sep + # The volume
	- re_page, # The page
	- re.UNICODE\|re.VERBOSE), ur'\g<title_tag> <cds.VOL>\g<vol></cds.VOL> ' \
	- ur'<cds.PG>\g<page></cds.PG>')
	+ re_page + # The page
	+ u')', re.UNICODE\|re.VERBOSE)


	-re_correct_numeration_2nd_try_ptn4 = (re.compile(
	- re_title_tag + re_sep + # Recognised, tagged title
	+re_correct_numeration_2nd_try_ptn4 = re.compile(
	+ re_title_tag +
	+ u'(?P<aftertitle>' +
	+ re_sep + # Recognised, tagged title
	re_year + ur"\s[.,\s:]\s" + # Year
	re_volume + re_sep + # The volume
	- re_page, # The page
	- re.UNICODE\|re.VERBOSE), ur'\g<title_tag> : <cds.VOL>\g<vol></cds.VOL> ' \
	- ur'<cds.YR>(\g<year>)</cds.YR> <cds.PG>\g<page></cds.PG>')
	+ re_page + # The page
	+ u')', re.UNICODE\|re.VERBOSE)


	-re_correct_numeration_2nd_try_ptn5 = (re.compile(
	- re_title_tag + re_sep + re_volume, re.UNICODE\|re.VERBOSE),
	- ur'\g<title_tag> <cds.VOL>\g<vol></cds.VOL>')
	-
	## precompile some regexps used to search for and standardize
	## numeration patterns in a line for the first time:

	## Delete the colon and expressions such as Serie, vol, V. inside the pattern
	## <serie : volume> E.g. Replace the string """Series A, Vol 4""" with """A 4"""
	re_strip_series_and_volume_labels = (re.compile(
	ur'(Serie\s\|\bS\.?\s)?([A-H])\s?[:,]\s?(\b[Vv]o?l?\.?\|\b[Nn]o\.?)?\s?(\d+)', re.UNICODE),
	ur'\g<2> \g<4>')


	## This pattern is not compiled, but rather included in
	## the other numeration paterns:
	re_nucphysb_subtitle = \
	ur'(?:[$\[]\s(?:[Ff][Ss]\|[Pp][Mm])\s\d{0,4}\s*[$\]])'
	re_nucphysb_subtitle_opt = \
	u'(?:' + re_sep + re_nucphysb_subtitle + u')?'


	## the 4 main numeration patterns:

	## Pattern 1: <vol, page, year>

	## <v, p, y>
	-re_numeration_vol_page_yr = (re.compile(
	+re_numeration_vol_page_yr = re.compile(
	+ re_start +
	re_volume + re_volume_sub_number_opt + re_sep +
	re_page + re_sep_or_parentesis +
	- re_year, re.UNICODE\|re.VERBOSE), ur' : <cds.VOL>\g<vol></cds.VOL> ' \
	- ur'<cds.YR>(\g<year>)</cds.YR> ' \
	- ur'<cds.PG>\g<page></cds.PG> ')
	+ re_year, re.UNICODE\|re.VERBOSE)

	## <v, [FS], p, y>
	-re_numeration_vol_nucphys_page_yr = (re.compile(
	+re_numeration_vol_nucphys_page_yr = re.compile(
	+ re_start +
	re_volume + re_volume_sub_number_opt + re_sep +
	re_nucphysb_subtitle + re_sep +
	re_page + re_sep_or_parentesis +
	- re_year, re.UNICODE\|re.VERBOSE), ur' : <cds.VOL>\g<vol></cds.VOL> ' \
	- ur'<cds.YR>(\g<year>)</cds.YR> ' \
	- ur'<cds.PG>\g<page></cds.PG> ')
	+ re_year, re.UNICODE\|re.VERBOSE)

	## <[FS], v, p, y>
	-re_numeration_nucphys_vol_page_yr = (re.compile(
	+re_numeration_nucphys_vol_page_yr = re.compile(
	+ re_start +
	re_nucphysb_subtitle + re_sep +
	re_volume + re_sep +
	re_page + re_sep_or_parentesis +
	- re_year, re.UNICODE\|re.VERBOSE), ur' : <cds.VOL>\g<vol></cds.VOL> ' \
	- ur'<cds.YR>(\g<year>)</cds.YR> ' \
	- ur'<cds.PG>\g<page></cds.PG> ')
	+ re_year, re.UNICODE\|re.VERBOSE)

	## Pattern 2: <vol, year, page>

	## <v, y, p>
	-re_numeration_vol_yr_page = (re.compile(
	+re_numeration_vol_yr_page = re.compile(
	+ re_start +
	re_volume + re_sep_or_parentesis +
	re_year + re_sep_or_after_parentesis +
	- re_page, re.UNICODE\|re.VERBOSE), ur' : <cds.VOL>\g<vol></cds.VOL> ' \
	- ur'<cds.YR>(\g<year>)</cds.YR> ' \
	- ur'<cds.PG>\g<page></cds.PG> ')
	+ re_page, re.UNICODE\|re.VERBOSE)

	## <v, sv, [FS]?, y, p>
	-re_numeration_vol_subvol_nucphys_yr_page = (re.compile(
	+re_numeration_vol_subvol_nucphys_yr_page = re.compile(
	+ re_start +
	re_volume + re_volume_sub_number_opt +
	re_nucphysb_subtitle_opt + re_sep_or_parentesis +
	re_year + re_sep_or_after_parentesis +
	- re_page, re.UNICODE\|re.VERBOSE), ur' : <cds.VOL>\g<vol></cds.VOL> ' \
	- ur'<cds.YR>(\g<year>)</cds.YR> ' \
	- ur'<cds.PG>\g<page></cds.PG> ')
	+ re_page, re.UNICODE\|re.VERBOSE)

	## <v, [FS]?, y, sv, p>
	-re_numeration_vol_nucphys_yr_subvol_page = (re.compile(
	+re_numeration_vol_nucphys_yr_subvol_page = re.compile(
	+ re_start +
	re_volume + re_nucphysb_subtitle_opt +
	re_sep_or_parentesis +
	re_year + re_volume_sub_number_opt + re_sep +
	- re_page, re.UNICODE\|re.VERBOSE), ur' : <cds.VOL>\g<vol></cds.VOL> ' \
	- ur'<cds.YR>(\g<year>)</cds.YR> ' \
	- ur'<cds.PG>\g<page></cds.PG> ')
	+ re_page, re.UNICODE\|re.VERBOSE)

	## <[FS]?, v, y, p>
	-re_numeration_nucphys_vol_yr_page = (re.compile(
	+re_numeration_nucphys_vol_yr_page = re.compile(
	+ re_start +
	re_nucphysb_subtitle + re_sep +
	re_volume + re_sep_or_parentesis + # The volume (optional "vol"/"no")
	re_year + re_sep_or_after_parentesis + # Year
	- re_page, re.UNICODE\|re.VERBOSE), ur' : <cds.VOL>\g<vol></cds.VOL> ' \
	- ur'<cds.YR>(\g<year>)</cds.YR> ' \
	- ur'<cds.PG>\g<page></cds.PG> ')
	+ re_page, re.UNICODE\|re.VERBOSE)

	## Pattern 3: <vol, serie, year, page>

	## <v, s, [FS]?, y, p>
	# re_numeration_vol_series_nucphys_yr_page = (re.compile(
	# re_volume + re_sep +
	# re_series + re_sep +
	# _sre_non_compiled_pattern_nucphysb_subtitle + re_sep_or_parentesis +
	# re_year + re_sep +
	# re_page, re.UNICODE\|re.VERBOSE), ur' \g<series> : ' \
	# ur'<cds.VOL>\g<vol></cds.VOL> ' \
	# ur'<cds.YR>(\g<year>)</cds.YR> ' \
	# ur'<cds.PG>\g<page></cds.PG> ')

	## <v, [FS]?, s, y, p
	-re_numeration_vol_nucphys_series_yr_page = (re.compile(
	+re_numeration_vol_nucphys_series_yr_page = re.compile(
	+ re_start +
	re_volume + re_nucphysb_subtitle_opt + re_sep +
	re_series + re_sep_or_parentesis +
	re_year + re_sep_or_after_parentesis +
	- re_page, re.UNICODE\|re.VERBOSE), ur' \g<series> : ' \
	- ur'<cds.VOL>\g<vol></cds.VOL> ' \
	- ur'<cds.YR>(\g<year>)</cds.YR> ' \
	- ur'<cds.PG>\g<page></cds.PG> ')
	+ re_page, re.UNICODE\|re.VERBOSE)

	## Pattern 4: <vol, serie, page, year>
	## <v, s, [FS]?, p, y>
	-re_numeration_vol_series_nucphys_page_yr = (re.compile(
	+re_numeration_vol_series_nucphys_page_yr = re.compile(
	+ re_start +
	re_volume + re_sep +
	re_series + re_nucphysb_subtitle_opt + re_sep +
	re_page + re_sep +
	- re_year, re.UNICODE\|re.VERBOSE), ur' \g<series> : ' \
	- ur'<cds.VOL>\g<vol></cds.VOL> ' \
	- ur'<cds.YR>(\g<year>)</cds.YR> ' \
	- ur'<cds.PG>\g<page></cds.PG> ')
	+ re_year, re.UNICODE\|re.VERBOSE)

	## <v, [FS]?, s, p, y>
	-re_numeration_vol_nucphys_series_page_yr = (re.compile(
	- re_volume + re_nucphysb_subtitle_opt + re_sep +
	- re_series + re_sep +
	- re_page + re_sep +
	- re_year, re.UNICODE\|re.VERBOSE), ur' \g<series> : ' \
	- ur'<cds.VOL>\g<vol></cds.VOL> ' \
	- ur'<cds.YR>(\g<year>)</cds.YR> ' \
	- ur'<cds.PG>\g<page></cds.PG> ')
	+re_numeration_vol_nucphys_series_page_yr = re.compile(
	+ re_start +
	+ re_volume + re_nucphysb_subtitle_opt + re_sep +
	+ re_series + re_sep +
	+ re_page + re_sep +
	+ re_year, re.UNICODE\|re.VERBOSE)

	## Pattern 5: <year, vol, page>
	-re_numeration_yr_vol_page = (re.compile(
	- re_year + re_sep_or_after_parentesis +
	- re_volume + re_sep +
	- re_page, re.UNICODE\|re.VERBOSE), ur' : <cds.VOL>\g<vol></cds.VOL> ' \
	- ur'<cds.YR>(\g<year>)</cds.YR> ' \
	- ur'<cds.PG>\g<page></cds.PG> ')
	+re_numeration_yr_vol_page = re.compile(
	+ re_start +
	+ re_year + re_sep_or_after_parentesis +
	+ re_volume + re_sep +
	+ re_page, re.UNICODE\|re.VERBOSE)


	## Pattern used to locate references of a doi inside a citation
	## This pattern matches both url (http) and 'doi:' or 'DOI' formats
	re_doi = (re.compile(ur"""
	- (($?[Dd][Oo][Ii](\s)$?:?(\s)) #'doi:' or 'doi' or '(doi)' (upper or lower case)
	- \|(https?:\/\/dx\.doi\.org\/))? #or 'http://dx.doi.org/' (neither has to be present)
	- (10\. #10. (mandatory for DOI's)
	- \d{4} #[0-9] x4
	- \/ #/
	- [\w\-_;\/\.]+ #any character
	- [\w\-_;\/]) #any character excluding a full stop
	+ (($?[Dd][Oo][Ii](\s)$?:?(\s)) # 'doi:' or 'doi' or '(doi)' (upper or lower case)
	+ \|(https?://dx\.doi\.org\/))? # or 'http://dx.doi.org/' (neither has to be present)
	+ (10\. # 10. (mandatory for DOI's)
	+ \d{4} # [0-9] x4
	+ / # /
	+ [\w\-_:;/\.<>]+ # any character
	+ [\w\-_:;/<>]) # any character excluding a full stop
	""", re.VERBOSE))


	def _create_regex_pattern_add_optional_spaces_to_word_characters(word):
	"""Add the regex special characters (\s*) to allow optional spaces between
	the characters in a word.
	@param word: (string) the word to be inserted into a regex pattern.
	@return: string: the regex pattern for that word with optional spaces
	between all of its characters.
	"""
	new_word = u""
	for ch in word:
	if ch.isspace():
	new_word += ch
	else:
	new_word += ch + ur'\s*'
	return new_word


	def get_reference_section_title_patterns():
	"""Return a list of compiled regex patterns used to search for the title of
	a reference section in a full-text document.
	@return: (list) of compiled regex patterns.
	"""
	patterns = []
	titles = [u'references',
	u'references.',
	u'r\u00C9f\u00E9rences',
	u'r\u00C9f\u00C9rences',
	u'reference',
	u'refs',
	u'r\u00E9f\u00E9rence',
	u'r\u00C9f\u00C9rence',
	u'r\xb4ef\xb4erences',
	u'r\u00E9fs',
	u'r\u00C9fs',
	u'bibliography',
	u'bibliographie',
	u'citations',
	u'literaturverzeichnis']
	sect_marker = u'^\s([\[\-\{\(])?\s' \
	u'((\w\|\d){1,5}([\.\-\,](\w\|\d){1,5})?\s*' \
	u'[\.\-\}\)\]]\s*)?' \
	u'(?P<title>'
	sect_marker1 = u'^(\d){1,3}\s*(?P<title>'
	line_end = ur'(\ss\se\sc\st\si\so\sn\s)?)([\)\}\]])?' \
	ur'($\|\s[\[\{$\<]\s[1a-z]\s*[\}$\>\]]\|\:$)'

	for t in titles:
	t_ptn = re.compile(sect_marker + \
	_create_regex_pattern_add_optional_spaces_to_word_characters(t) + \
	line_end, re.I\|re.UNICODE)
	patterns.append(t_ptn)
	## allow e.g. 'N References' to be found where N is an integer
	t_ptn = re.compile(sect_marker1 + \
	_create_regex_pattern_add_optional_spaces_to_word_characters(t) + \
	line_end, re.I\|re.UNICODE)
	patterns.append(t_ptn)

	return patterns


	def get_reference_line_numeration_marker_patterns(prefix=u''):
	"""Return a list of compiled regex patterns used to search for the marker
	of a reference line in a full-text document.
	@param prefix: (string) the possible prefix to a reference line
	@return: (list) of compiled regex patterns.
	"""
	title = u""
	if type(prefix) in (str, unicode):
	title = prefix
	g_name = u'(?P<mark>'
	g_close = u')'
	- space = ur'^\s*'
	+ space = ur'\s*'
	patterns = [
	+ # [1]
	space + title + g_name + ur'\[\s(?P<marknum>\d+)\s\]' + g_close,
	- space + title + g_name + ur'\[\s[a-zA-Z]+\+?\s?(\d{1,4}[A-Za-z]?)?\s\]' + g_close,
	+ # [<letters and numbers]
	+ space + title + g_name + ur'\[\s[a-zA-Z:-]+\+?\s?(\d{1,4}[A-Za-z:-]?)?\s\]' + g_close,
	+ # {1}
	space + title + g_name + ur'\{\s(?P<marknum>\d+)\s\}' + g_close,
	+ # (1)
	space + title + g_name + ur'\<\s(?P<marknum>\d+)\s\>' + g_close,
	space + title + g_name + ur'$\s(?P<marknum>\d+)\s$' + g_close,
	space + title + g_name + ur'(?P<marknum>\d+)\s*\.(?!\d)' + g_close,
	space + title + g_name + ur'(?P<marknum>\d+)\s+' + g_close,
	space + title + g_name + ur'(?P<marknum>\d+)\s*\]' + g_close,
	+ # 1]
	space + title + g_name + ur'(?P<marknum>\d+)\s*\}' + g_close,
	+ # 1}
	space + title + g_name + ur'(?P<marknum>\d+)\s*\)' + g_close,
	+ # 1)
	space + title + g_name + ur'(?P<marknum>\d+)\s*\>' + g_close,
	+ # [1.1]
	+ space + title + g_name + ur'\[\s\d+\.\d+\s\]' + g_close,
	+ # [ ]
	space + title + g_name + ur'\[\s*\]' + g_close,
	+ # *
	space + title + g_name + ur'\*' + g_close,
	]
	return [re.compile(p, re.I\|re.UNICODE) for p in patterns]


	def get_reference_line_marker_pattern(pattern):
	"""Return a list of compiled regex patterns used to search for the first
	reference line in a full-text document.
	The line is considered to start with either: [1] or {1}
	The line is considered to start with : 1. or 2. or 3. etc
	The line is considered to start with : 1 or 2 etc (just a number)
	@return: (list) of compiled regex patterns.
	"""
	return re.compile(u'(?P<mark>' + pattern + u')', re.I\|re.UNICODE)

	re_reference_line_bracket_markers = get_reference_line_marker_pattern(
	ur'(?P<left>\[)\s(?P<marknum>\d+)\s(?P<right>\])'
	)
	re_reference_line_curly_bracket_markers = get_reference_line_marker_pattern(
	ur'(?P<left>\{)\s(?P<marknum>\d+)\s(?P<right>\})'
	)
	re_reference_line_dot_markers = get_reference_line_marker_pattern(
	ur'(?P<left>)\s(?P<marknum>\d+)\s(?P<right>\.)'
	)
	re_reference_line_number_markers = get_reference_line_marker_pattern(
	ur'(?P<left>)\s(?P<marknum>\d+)\s(?P<right>)'
	)


	def get_post_reference_section_title_patterns():
	"""Return a list of compiled regex patterns used to search for the title
	of the section after the reference section in a full-text document.
	@return: (list) of compiled regex patterns.
	"""
	compiled_patterns = []
	thead = ur'^\s([\{$\<\[]?\s(\w\|\d)\s[$\}\>\.\-\]]?\s)?'
	ttail = ur'(\s\:\s)?'
	numatn = ur'(\d+\|\w\b\|i{1,3}v?\|vi{0,3})[\.\,]{0,2}\b'
	roman_numbers = ur'[LVIX]'
	patterns = [
	# Section titles
	thead + _create_regex_pattern_add_optional_spaces_to_word_characters(u'appendix') + ttail,
	thead + _create_regex_pattern_add_optional_spaces_to_word_characters(u'appendices') + ttail,
	thead + _create_regex_pattern_add_optional_spaces_to_word_characters(u'acknowledgement') + ur's?' + ttail,
	thead + _create_regex_pattern_add_optional_spaces_to_word_characters(u'acknowledgment') + ur's?' + ttail,
	thead + _create_regex_pattern_add_optional_spaces_to_word_characters(u'table') + ur'\w?s?\d?' + ttail,
	thead + _create_regex_pattern_add_optional_spaces_to_word_characters(u'figure') + ur's?' + ttail,
	thead + _create_regex_pattern_add_optional_spaces_to_word_characters(u'list of figure') + ur's?' + ttail,
	thead + _create_regex_pattern_add_optional_spaces_to_word_characters(u'annex') + ur's?' + ttail,
	thead + _create_regex_pattern_add_optional_spaces_to_word_characters(u'discussion') + ur's?' + ttail,
	thead + _create_regex_pattern_add_optional_spaces_to_word_characters(u'remercie') + ur's?' + ttail,
	thead + _create_regex_pattern_add_optional_spaces_to_word_characters(u'index') + ur's?' + ttail,
	thead + _create_regex_pattern_add_optional_spaces_to_word_characters(u'summary') + ur's?' + ttail,
	# Figure nums
	ur'^\s*' + _create_regex_pattern_add_optional_spaces_to_word_characters(u'figure') + numatn,
	ur'^\s' + _create_regex_pattern_add_optional_spaces_to_word_characters(u'fig') + ur'\.\s' + numatn,
	ur'^\s' + _create_regex_pattern_add_optional_spaces_to_word_characters(u'fig') + ur'\.?\s\d\w?\b',
	# Tables
	ur'^\s*' + _create_regex_pattern_add_optional_spaces_to_word_characters(u'table') + numatn,
	ur'^\s' + _create_regex_pattern_add_optional_spaces_to_word_characters(u'tab') + ur'\.\s' + numatn,
	ur'^\s' + _create_regex_pattern_add_optional_spaces_to_word_characters(u'tab') + ur'\.?\s\d\w?\b',
	# Other titles formats
	ur'^\s' + roman_numbers + ur'\.?\s[Cc]onclusion[\w\s]*$',
	+ ur'^\sAppendix\s[A-Z]\s\:\s[a-zA-Z]+\s',
	]

	for p in patterns:
	compiled_patterns.append(re.compile(p, re.I\|re.UNICODE))

	return compiled_patterns


	def get_post_reference_section_keyword_patterns():
	"""Return a list of compiled regex patterns used to search for various
	keywords that can often be found after, and therefore suggest the end of,
	a reference section in a full-text document.
	@return: (list) of compiled regex patterns.
	"""
	compiled_patterns = []
	patterns = [u'(' + _create_regex_pattern_add_optional_spaces_to_word_characters(u'prepared') + \
	ur'\|' + _create_regex_pattern_add_optional_spaces_to_word_characters(u'created') + \
	ur').(AAS\s)?\sLATEX',
	ur'AAS\s+?LATEX\s+?' + _create_regex_pattern_add_optional_spaces_to_word_characters(u'macros') + u'v',
	ur'^\s*' + _create_regex_pattern_add_optional_spaces_to_word_characters(u'This paper has been produced using'),
	ur'^\s*' + \
	_create_regex_pattern_add_optional_spaces_to_word_characters(u'This article was processed by the author using Springer-Verlag') + \
	u' LATEX']
	for p in patterns:
	compiled_patterns.append(re.compile(p, re.I\|re.UNICODE))
	return compiled_patterns


	def regex_match_list(line, patterns):
	"""Given a list of COMPILED regex patters, perform the "re.match" operation
	on the line for every pattern.
	Break from searching at the first match, returning the match object.
	In the case that no patterns match, the None type will be returned.
	@param line: (unicode string) to be searched in.
	@param patterns: (list) of compiled regex patterns to search "line"
	with.
	@return: (None or an re.match object), depending upon whether one of
	the patterns matched within line or not.
	"""
	m = None
	for ptn in patterns:
	m = ptn.match(line)
	if m is not None:
	break
	return m

	# The different forms of arXiv notation
	re_arxiv_notation = re.compile(ur"""
	(arxiv)\|(e[\-\s]?print:?\s*arxiv)
	""", re.VERBOSE)

	# et. al. before J. /// means J is a journal

	re_num = re.compile(ur'(\d+)')
	diff --git a/modules/docextract/lib/refextract_record.py b/modules/docextract/lib/refextract_record.py
	new file mode 100644
	index 000000000..5ec5daebc
	--- /dev/null
	+++ b/modules/docextract/lib/refextract_record.py
	@@ -0,0 +1,257 @@
	+# -- coding: utf-8 --
	+##
	+## This file is part of Invenio.
	+## Copyright (C) 2013 CERN.
	+##
	+## Invenio is free software; you can redistribute it and/or
	+## modify it under the terms of the GNU General Public License as
	+## published by the Free Software Foundation; either version 2 of the
	+## License, or (at your option) any later version.
	+##
	+## Invenio is distributed in the hope that it will be useful, but
	+## WITHOUT ANY WARRANTY; without even the implied warranty of
	+## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	+## General Public License for more details.
	+##
	+## You should have received a copy of the GNU General Public License
	+## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	+## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.
	+
	+from datetime import datetime
	+
	+from invenio.docextract_record import BibRecord, \
	+ BibRecordField
	+from invenio.refextract_config import \
	+ CFG_REFEXTRACT_FIELDS, \
	+ CFG_REFEXTRACT_IND1_REFERENCE, \
	+ CFG_REFEXTRACT_IND2_REFERENCE, \
	+ CFG_REFEXTRACT_TAG_ID_EXTRACTION_STATS, \
	+ CFG_REFEXTRACT_SUBFIELD_EXTRACTION_STATS, \
	+ CFG_REFEXTRACT_SUBFIELD_EXTRACTION_TIME, \
	+ CFG_REFEXTRACT_SUBFIELD_EXTRACTION_VERSION, \
	+ CFG_REFEXTRACT_VERSION
	+
	+from invenio import config
	+CFG_INSPIRE_SITE = getattr(config, 'CFG_INSPIRE_SITE', False)
	+
	+
	+def format_marker(line_marker):
	+ return line_marker.strip("[](){}. ")
	+
	+
	+def build_record(counts, fields, recid=None, status_code=0):
	+ """Given a series of MARC XML-ized reference lines and a record-id, write a
	+ MARC XML record to the stdout stream. Include in the record some stats
	+ for the extraction job.
	+ The printed MARC XML record will essentially take the following
	+ structure:
	+ <record>
	+ <controlfield tag="001">1</controlfield>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ [...]
	+ </datafield>
	+ [...]
	+ <datafield tag="999" ind1="C" ind2="6">
	+ <subfield code="a">
	+ Invenio/X.XX.X refextract/X.XX.X-timestamp-err-repnum-title-URL-misc
	+ </subfield>
	+ </datafield>
	+ </record>
	+ Timestamp, error(code), reportnum, title, URL, and misc will are of
	+ course take the relevant values.
	+
	+ @param status_code: (integer)the status of reference-extraction for the
	+ given record: was there an error or not? 0 = no error; 1 = error.
	+ @param count_reportnum: (integer) - the number of institutional
	+ report-number citations found in the document's reference lines.
	+ @param count_title: (integer) - the number of journal title citations
	+ found in the document's reference lines.
	+ @param count_url: (integer) - the number of URL citations found in the
	+ document's reference lines.
	+ @param count_misc: (integer) - the number of sections of miscellaneous
	+ text (i.e. 999C5$m) from the document's reference lines.
	+ @param count_auth_group: (integer) - the total number of author groups
	+ identified ($h)
	+ @param recid: (string) - the record-id of the given document. (put into
	+ 001 field.)
	+ @param xml_lines: (list) of strings. Each string in the list contains a
	+ group of MARC XML 999C5 datafields, making up a single reference line.
	+ These reference lines will make up the document body.
	+ @return: The entire MARC XML textual output, plus recognition statistics.
	+ """
	+ record = BibRecord(recid=recid)
	+ record['999'] = fields
	+ field = record.add_field(CFG_REFEXTRACT_TAG_ID_EXTRACTION_STATS)
	+ stats_str = "%(status)s-%(reportnum)s-%(title)s-%(author)s-%(url)s-%(doi)s-%(misc)s" % {
	+ 'status' : status_code,
	+ 'reportnum' : counts['reportnum'],
	+ 'title' : counts['title'],
	+ 'author' : counts['auth_group'],
	+ 'url' : counts['url'],
	+ 'doi' : counts['doi'],
	+ 'misc' : counts['misc'],
	+ }
	+ field.add_subfield(CFG_REFEXTRACT_SUBFIELD_EXTRACTION_STATS,
	+ stats_str)
	+ field.add_subfield(CFG_REFEXTRACT_SUBFIELD_EXTRACTION_TIME,
	+ datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
	+ field.add_subfield(CFG_REFEXTRACT_SUBFIELD_EXTRACTION_VERSION,
	+ CFG_REFEXTRACT_VERSION)
	+
	+ return record
	+
	+
	+def build_references(citations):
	+ """Build marc xml from a references list
	+
	+ Transform the reference elements into marc xml
	+ """
	+ # Now, run the method which will take as input:
	+ # 1. A list of lists of dictionaries, where each dictionary is a piece
	+ # of citation information corresponding to a tag in the citation.
	+ # 2. The line marker for this entire citation line (mulitple citation
	+ # 'finds' inside a single citation will use the same marker value)
	+ # The resulting xml line will be a properly marked up form of the
	+ # citation. It will take into account authors to try and split up
	+ # references which should be read as two SEPARATE ones.
	+ return [c for citation_elements in citations
	+ for elements in citation_elements['elements']
	+ for c in build_reference_fields(elements,
	+ citation_elements['line_marker'])]
	+
	+
	+def add_subfield(field, code, value):
	+ return field.add_subfield(CFG_REFEXTRACT_FIELDS[code], value)
	+
	+
	+def add_journal_subfield(field, element, inspire_format):
	+ if inspire_format:
	+ value = '%(title)s,%(volume)s,%(page)s' % element
	+ else:
	+ value = '%(title)s %(volume)s (%(year)s) %(page)s' % element
	+
	+ return add_subfield(field, 'journal', value)
	+
	+
	+def create_reference_field(line_marker):
	+ field = BibRecordField(ind1=CFG_REFEXTRACT_IND1_REFERENCE,
	+ ind2=CFG_REFEXTRACT_IND2_REFERENCE)
	+ if line_marker.strip("., [](){}"):
	+ add_subfield(field, 'linemarker', format_marker(line_marker))
	+ return field
	+
	+
	+def build_reference_fields(citation_elements, line_marker, inspire_format=None):
	+ """ Create the MARC-XML string of the found reference information which
	+ was taken from a tagged reference line.
	+ @param citation_elements: (list) an ordered list of dictionary elements,
	+ with each element corresponding to a found
	+ piece of information from a reference line.
	+ @param line_marker: (string) The line marker for this single reference
	+ line (e.g. [19])
	+ @return xml_line: (string) The MARC-XML representation of the list of
	+ reference elements
	+ """
	+ if inspire_format is None:
	+ inspire_format = CFG_INSPIRE_SITE
	+
	+ ## Begin the datafield element
	+ current_field = create_reference_field(line_marker)
	+
	+ reference_fields = [current_field]
	+
	+ ## This will hold the ordering of tags which have been appended to the xml line
	+ ## This list will be used to control the desisions involving the creation of new citation lines
	+ ## (in the event of a new set of authors being recognised, or strange title ordering...)
	+ line_elements = []
	+
	+ for element in citation_elements:
	+ ## Before going onto checking 'what' the next element is, handle misc text and semi-colons
	+ ## Multiple misc text subfields will be compressed later
	+ ## This will also be the only part of the code that deals with MISC tag_typed elements
	+ misc_txt = element['misc_txt']
	+ if misc_txt.strip("., [](){}"):
	+ misc_txt = misc_txt.lstrip('])} ,.').rstrip('[({ ,.')
	+ add_subfield(current_field, 'misc', misc_txt)
	+
	+ # Now handle the type dependent actions
	+ # JOURNAL
	+ if element['type'] == "JOURNAL":
	+ add_journal_subfield(current_field, element, inspire_format)
	+ line_elements.append(element)
	+
	+ # REPORT NUMBER
	+ elif element['type'] == "REPORTNUMBER":
	+ add_subfield(current_field, 'reportnumber', element['report_num'])
	+ line_elements.append(element)
	+
	+ # URL
	+ elif element['type'] == "URL":
	+ if element['url_string'] == element['url_desc']:
	+ # Build the datafield for the URL segment of the reference line:
	+ add_subfield(current_field, 'url', element['url_string'])
	+ # Else, in the case that the url string and the description differ in some way, include them both
	+ else:
	+ add_subfield(current_field, 'url', element['url_string'])
	+ add_subfield(current_field, 'urldesc', element['url_desc'])
	+ line_elements.append(element)
	+
	+ # DOI
	+ elif element['type'] == "DOI":
	+ add_subfield(current_field, 'doi', element['doi_string'])
	+ line_elements.append(element)
	+
	+ # AUTHOR
	+ elif element['type'] == "AUTH":
	+ value = element['auth_txt']
	+ if element['auth_type'] == 'incl':
	+ value = "(%s)" % value
	+
	+ add_subfield(current_field, 'author', value)
	+ line_elements.append(element)
	+
	+ elif element['type'] == "QUOTED":
	+ add_subfield(current_field, 'title', element['title'])
	+ line_elements.append(element)
	+
	+ elif element['type'] == "ISBN":
	+ add_subfield(current_field, 'isbn', element['ISBN'])
	+ line_elements.append(element)
	+
	+ elif element['type'] == "BOOK":
	+ add_subfield(current_field, 'title', element['title'])
	+ line_elements.append(element)
	+
	+ elif element['type'] == "PUBLISHER":
	+ add_subfield(current_field, 'publisher', element['publisher'])
	+ line_elements.append(element)
	+
	+ elif element['type'] == "YEAR":
	+ add_subfield(current_field, 'year', element['year'])
	+ line_elements.append(element)
	+
	+ elif element['type'] == "COLLABORATION":
	+ add_subfield(current_field,
	+ 'collaboration',
	+ element['collaboration'])
	+ line_elements.append(element)
	+
	+ elif element['type'] == "RECID":
	+ add_subfield(current_field, 'recid', str(element['recid']))
	+ line_elements.append(element)
	+
	+ for field in reference_fields:
	+ merge_misc(field)
	+
	+ return reference_fields
	+
	+
	+def merge_misc(field):
	+ current_misc = None
	+ for subfield in field.subfields[:]:
	+ if subfield.code == 'm':
	+ if current_misc is None:
	+ current_misc = subfield
	+ else:
	+ current_misc.value += " " + subfield.value
	+ field.subfields.remove(subfield)
	diff --git a/modules/docextract/lib/refextract_regression_tests.py b/modules/docextract/lib/refextract_regression_tests.py
	index 312e731df..be2d7eede 100644
	--- a/modules/docextract/lib/refextract_regression_tests.py
	+++ b/modules/docextract/lib/refextract_regression_tests.py
	@@ -1,2441 +1,2816 @@
	# -- coding: utf-8 --
	##
	## This file is part of Invenio.
	## Copyright (C) 2010, 2011, 2013 CERN.
	##
	## Invenio is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## Invenio is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	"""
	The Refextract regression tests suite

	The tests will not modifiy the database.
	They are intended to make sure there is no regression in references parsing.
	"""

	from invenio.testutils import InvenioTestCase
	import re

	-from invenio.testutils import make_test_suite, run_test_suite
	-## Import the minimal necessary methods and variables needed to run Refextract
	+from invenio.testutils import make_test_suite, run_test_suite, InvenioXmlTestCase
	from invenio.refextract_engine import parse_references
	from invenio.docextract_utils import setup_loggers
	from invenio.refextract_text import wash_and_repair_reference_line
	from invenio import refextract_kbs
	-from invenio import refextract_xml
	+from invenio import refextract_record


	-def compare_references(test, references, expected_references, ignore_misc=True):
	- out = references
	-
	- # Remove the ending statistical datafield from the final extracted references
	- out = out[:out.find('<datafield tag="999" ind1="C" ind2="6">')].rstrip()
	- out += "\n</record>"
	+def compare_references(test, record, expected_references, ignore_misc=True):
	+ # Remove the statistical datafield from the final extracted references
	+ record['999'] = record.find_fields('999C5')

	if ignore_misc:
	# We don't care about what's in the misc field
	- out = re.sub(' <subfield code="m">[^<]*</subfield>\n', '', out)
	-
	- if out != expected_references:
	- print 'OUT'
	- print out
	+ for field in record['999']:
	+ field.subfields = [subfield for subfield in field.subfields
	+ if subfield.code != 'm']

	- test.assertEqual(out, expected_references)
	+ test.assertXmlEqual(record.to_xml(), expected_references.encode('utf-8'))


	def _reference_test(test, ref_line, parsed_reference, ignore_misc=True):
	#print u'refs: %s' % ref_line
	ref_line = wash_and_repair_reference_line(ref_line)
	#print u'cleaned: %s' % ref_line
	out = parse_references([ref_line], kbs_files={
	'journals' : test.kb_journals,
	'journals-re' : test.kb_journals_re,
	'report-numbers' : test.kb_report_numbers,
	'books' : test.kb_books,
	})
	compare_references(test, out, parsed_reference, ignore_misc=ignore_misc)


	-class RefextractInvenioTest(InvenioTestCase):
	+class RefextractInvenioTest(InvenioXmlTestCase):

	def setUp(self):
	self.old_override = refextract_kbs.CFG_REFEXTRACT_KBS_OVERRIDE
	refextract_kbs.CFG_REFEXTRACT_KBS_OVERRIDE = {}

	- self.old_inspire = refextract_xml.CFG_INSPIRE_SITE
	- refextract_xml.CFG_INSPIRE_SITE = False
	+ self.old_inspire = refextract_record.CFG_INSPIRE_SITE
	+ refextract_record.CFG_INSPIRE_SITE = False

	setup_loggers(verbosity=0)
	self.maxDiff = 2000
	self.kb_journals = None
	self.kb_journals_re = None
	self.kb_report_numbers = None
	self.kb_authors = None
	self.kb_books = None
	self.kb_conferences = None

	def tearDown(self):
	refextract_kbs.CFG_REFEXTRACT_KBS_OVERRIDE = self.old_override
	- refextract_xml.CFG_INSPIRE_SITE = self.old_inspire
	+ refextract_record.CFG_INSPIRE_SITE = self.old_inspire

	def test_month_with_year(self):
	ref_line = u"""[2] S. Weinberg, A Model of Leptons, Phys. Rev. Lett. 19 (Nov, 1967) 1264–1266."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">2</subfield>
	<subfield code="h">S. Weinberg, A Model of Leptons</subfield>
	<subfield code="s">Phys. Rev. Lett. 19 (1967) 1264</subfield>
	<subfield code="y">1967</subfield>
	</datafield>
	</record>""")

	def test_numeration_not_finding_year(self):
	ref_line = u"""[137] M. Papakyriacou, H. Mayer, C. Pypen, H. P. Jr., and S. Stanzl-Tschegg, “Inﬂuence of loading frequency on high cycle fatigue properties of b.c.c. and h.c.p. metals,” Materials Science and Engineering, vol. A308, pp. 143–152, 2001."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">137</subfield>
	<subfield code="h">M. Papakyriacou, H. Mayer, C. Pypen, H. P. Jr., and S. Stanzl-Tschegg</subfield>
	<subfield code="t">Influence of loading frequency on high cycle fatigue properties of b.c.c. and h.c.p. metals</subfield>
	<subfield code="s">Mat.Sci.Eng. A308 (2001) 143</subfield>
	<subfield code="y">2001</subfield>
	</datafield>
	</record>""")

	def test_numeration_not_finding_year2(self):
	"""Bug fix test for numeration not finding year in this citation"""
	ref_line = u"""[138] Y.-B. Park, R. Mnig, and C. A. Volkert, “Frequency effect on thermal fatigue damage in Cu interconnects,” Thin Solid Films, vol. 515, pp. 3253– 3258, 2007."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">138</subfield>
	<subfield code="h">Y.-B. Park, R. Mnig, and C. A. Volkert</subfield>
	<subfield code="t">Frequency effect on thermal fatigue damage in Cu interconnects</subfield>
	<subfield code="s">Thin Solid Films 515 (2007) 3253</subfield>
	<subfield code="y">2007</subfield>
	</datafield>
	</record>""")

	def test_extra_a_in_report_number(self):
	- ref_line = u"""[6] ATL-PHYS-INT-2009-110 Atlas"""
	ref_line = u'[14] CMS Collaboration, CMS-PAS-HIG-12-002. CMS Collaboration, CMS-PAS-HIG-12-008. CMS Collaboration, CMS-PAS-HIG-12-022. ATLAS Collaboration, arXiv:1205.0701. ATLAS Collaboration, ATLAS-CONF-2012-078.'
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">14</subfield>
	- <subfield code="h">(CMS Collaboration)</subfield>
	+ <subfield code="c">CMS Collaboration</subfield>
	<subfield code="r">CMS-PAS-HIG-12-002</subfield>
	+ <subfield code="c">CMS Collaboration</subfield>
	<subfield code="r">CMS-PAS-HIG-12-008</subfield>
	+ <subfield code="c">CMS Collaboration</subfield>
	<subfield code="r">CMS-PAS-HIG-12-022</subfield>
	+ <subfield code="c">ATLAS Collaboration</subfield>
	<subfield code="r">arXiv:1205.0701</subfield>
	- <subfield code="r">ATL-CONF-2012-078</subfield>
	+ <subfield code="c">ATLAS Collaboration</subfield>
	+ <subfield code="r">ATLAS-CONF-2012-078</subfield>
	</datafield>
	</record>""")


	-class RefextractTest(InvenioTestCase):
	+class RefextractTest(InvenioXmlTestCase):
	"""Testing output of refextract"""

	def setUp(self):
	- self.old_inspire = refextract_xml.CFG_INSPIRE_SITE
	- refextract_xml.CFG_INSPIRE_SITE = True
	+ self.old_inspire = refextract_record.CFG_INSPIRE_SITE
	+ refextract_record.CFG_INSPIRE_SITE = True

	self.inspire = True
	self.kb_books = [
	('Griffiths, David', 'Introduction to elementary particles', '2008')
	]
	self.kb_journals = [
	("PHYSICAL REVIEW SPECIAL TOPICS ACCELERATORS AND BEAMS", "Phys.Rev.ST Accel.Beams"),
	("PHYS REV D", "Phys.Rev.;D"),
	("PHYS REV", "Phys.Rev."),
	("PHYS REV LETT", "Phys.Rev.Lett."),
	("PHYS LETT", "Phys.Lett."),
	("J PHYS", "J.Phys."),
	("JOURNAL OF PHYSICS", "J.Phys."),
	("J PHYS G", "J.Phys.;G"),
	("PHYSICAL REVIEW", "Phys.Rev."),
	("ADV THEO MATH PHYS", "Adv.Theor.Math.Phys."),
	("MATH PHYS", "Math.Phys."),
	("J MATH PHYS", "J.Math.Phys."),
	("JHEP", "JHEP"),
	("SITZUNGSBER PREUSS AKAD WISS PHYS MATH KL", "Sitzungsber.Preuss.Akad.Wiss.Berlin (Math.Phys.)"),
	("PHYS LETT", "Phys.Lett."),
	("NUCL PHYS", "Nucl.Phys."),
	("NUCL PHYS", "Nucl.Phys."),
	("NUCL PHYS PROC SUPPL", "Nucl.Phys.Proc.Suppl."),
	("JINST", "JINST"),
	("THE EUROPEAN PHYSICAL JOURNAL C PARTICLES AND FIELDS", "Eur.Phys.J.;C"),
	("COMMUN MATH PHYS", "Commun.Math.Phys."),
	("COMM MATH PHYS", "Commun.Math.Phys."),
	("REV MOD PHYS", "Rev.Mod.Phys."),
	("ANN PHYS U S", "Ann.Phys."),
	("AM J PHYS", "Am.J.Phys."),
	("PROC R SOC LONDON SER", "Proc.Roy.Soc.Lond."),
	("CLASS QUANT GRAVITY", "Class.Quant.Grav."),
	("FOUND PHYS", "Found.Phys."),
	("IEEE TRANS NUCL SCI", "IEEE Trans.Nucl.Sci."),
	("SCIENCE", "Science"),
	("ACTA MATERIALIA", "Acta Mater."),
	("REVIEWS OF MODERN PHYSICS", "Rev.Mod.Phys."),
	("NUCL INSTRUM METHODS", "Nucl.Instrum.Meth."),
	("Z PHYS", "Z.Phys."),
	+ ("Eur. Phys. J.", "Eur.Phys.J."),
	]
	self.kb_journals_re = [
	"DAN---Dokl.Akad.Nauk Ser.Fiz.",
	]
	self.kb_report_numbers = [
	"#####CERN#####",
	"< yy 999>",
	"< yyyy 999>",
	- "ATL CONF---ATL-CONF",
	"ATL PHYS INT---ATL-PHYS-INT",
	- "ATLAS CONF---ATL-CONF",
	- "#####LANL#####",
	- "<s/syymm999>",
	- "<syymm999>",
	- "ASTRO PH---astro-ph",
	- "HEP PH---hep-ph",
	- "HEP TH---hep-th",
	- "HEP EX---hep-ex",
	"#####LHC#####",
	"< yy 999>",
	"<syyyy 999>",
	"< 999>",
	"< 9999>",
	"CERN LHC PROJECT REPORT---CERN-LHC-Project-Report",
	"CLIC NOTE ---CERN-CLIC-Note",
	"CERN LHCC ---CERN-LHCC",
	"CERN EP ---CERN-EP",
	"######ATLANTIS#######",
	"< 9999999>",
	"CERN EX---CERN-EX",
	]
	setup_loggers(verbosity=0)
	self.maxDiff = 2500

	def tearDown(self):
	- refextract_xml.CFG_INSPIRE_SITE = self.old_inspire
	+ refextract_record.CFG_INSPIRE_SITE = self.old_inspire

	def test_year_title_volume_page(self):
	ref_line = u"[14] L. Randall and R. Sundrum, (1999) Phys. Rev. Lett. B83 S08004 More text"
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">14</subfield>
	<subfield code="h">L. Randall and R. Sundrum</subfield>
	<subfield code="s">Phys.Rev.Lett.,B83,S08004</subfield>
	<subfield code="y">1999</subfield>
	</datafield>
	</record>""")

	def test_url1(self):
	ref_line = u"""[1] <a href="http://cdsweb.cern.ch/">CERN Document Server</a> J. Maldacena, Adv. Theor. Math. Phys. 2 (1998) 231, hep-th/9711200; http://cdsweb.cern.ch/ then http://www.itp.ucsb.edu/online/susyc99/discussion/. ; L. Susskind, J. Math. Phys. 36 (1995) 6377, hep-th/9409089; hello world a<a href="http://uk.yahoo.com/">Yahoo!</a>. Fin."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">1</subfield>
	<subfield code="u">http://cdsweb.cern.ch/</subfield>
	<subfield code="z">CERN Document Server</subfield>
	<subfield code="h">J. Maldacena</subfield>
	<subfield code="s">Adv.Theor.Math.Phys.,2,231</subfield>
	<subfield code="r">hep-th/9711200</subfield>
	<subfield code="y">1998</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">1</subfield>
	<subfield code="u">http://cdsweb.cern.ch/</subfield>
	<subfield code="u">http://www.itp.ucsb.edu/online/susyc99/discussion/</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">1</subfield>
	<subfield code="h">L. Susskind</subfield>
	<subfield code="s">J.Math.Phys.,36,6377</subfield>
	<subfield code="r">hep-th/9409089</subfield>
	<subfield code="y">1995</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">1</subfield>
	<subfield code="u">http://uk.yahoo.com/</subfield>
	<subfield code="z">Yahoo!</subfield>
	</datafield>
	</record>""")

	def test_url2(self):
	ref_line = u"""[2] J. Maldacena, Adv. Theor. Math. Phys. 2 (1998) 231; hep-th/9711200. http://cdsweb.cern.ch/"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">2</subfield>
	<subfield code="h">J. Maldacena</subfield>
	<subfield code="s">Adv.Theor.Math.Phys.,2,231</subfield>
	<subfield code="y">1998</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">2</subfield>
	<subfield code="r">hep-th/9711200</subfield>
	<subfield code="u">http://cdsweb.cern.ch/</subfield>
	</datafield>
	</record>""")

	def test_url3(self):
	ref_line = u"3. “pUML Initial Submission to OMG’ s RFP for UML 2.0 Infrastructure”. URL http://www.cs.york.ac.uk/puml/"
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">3</subfield>
	<subfield code="t">pUML Initial Submission to OMG\u2019 s RFP for UML 2.0 Infrastructure</subfield>
	<subfield code="u">http://www.cs.york.ac.uk/puml/</subfield>
	</datafield>
	</record>""")

	def test_url4(self):
	ref_line = u"""[3] S. Gubser, I. Klebanov and A. Polyakov, Phys. Lett. B428 (1998) 105; hep-th/9802109. http://cdsweb.cern.ch/search.py?AGE=hello-world&ln=en"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">3</subfield>
	<subfield code="h">S. Gubser, I. Klebanov and A. Polyakov</subfield>
	<subfield code="s">Phys.Lett.,B428,105</subfield>
	<subfield code="y">1998</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">3</subfield>
	<subfield code="r">hep-th/9802109</subfield>
	<subfield code="u">http://cdsweb.cern.ch/search.py?AGE=hello-world&ln=en</subfield>
	</datafield>
	</record>""")

	+ def test_url5(self):
	+ ref_line = u"""[9] H. J. Drescher and Y. Nara, Phys. Rev. C 75, 034905 (2007); MC-KLN 3.46 at http://www.aiu.ac.jp/ynara/mckln/."""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">9</subfield>
	+ <subfield code="h">H. J. Drescher and Y. Nara</subfield>
	+ <subfield code="s">Phys.Rev.,C75,034905</subfield>
	+ <subfield code="y">2007</subfield>
	+ </datafield>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">9</subfield>
	+ <subfield code="u">http://www.aiu.ac.jp/ynara/mckln/</subfield>
	+ </datafield>
	+</record>""")
	+
	def test_hep(self):
	ref_line = u"""[5] O. Aharony, S. Gubser, J. Maldacena, H. Ooguri and Y. Oz, hep-th/9905111."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">5</subfield>
	<subfield code="h">O. Aharony, S. Gubser, J. Maldacena, H. Ooguri and Y. Oz</subfield>
	<subfield code="r">hep-th/9905111</subfield>
	</datafield>
	</record>""")

	def test_hep2(self):
	ref_line = u"""[4] E. Witten, Adv. Theor. Math. Phys. 2 (1998) 253; hep-th/9802150."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">4</subfield>
	<subfield code="h">E. Witten</subfield>
	<subfield code="s">Adv.Theor.Math.Phys.,2,253</subfield>
	<subfield code="y">1998</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">4</subfield>
	<subfield code="r">hep-th/9802150</subfield>
	</datafield>
	</record>""")

	def test_hep3(self):
	ref_line = u"""[6] L. Susskind, J. Math. Phys. 36 (1995) 6377; hep-th/9409089."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">6</subfield>
	<subfield code="h">L. Susskind</subfield>
	<subfield code="s">J.Math.Phys.,36,6377</subfield>
	<subfield code="y">1995</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">6</subfield>
	<subfield code="r">hep-th/9409089</subfield>
	</datafield>
	</record>""")

	def test_hep4(self):
	ref_line = u"""[7] L. Susskind and E. Witten, hep-th/9805114."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">7</subfield>
	<subfield code="h">L. Susskind and E. Witten</subfield>
	<subfield code="r">hep-th/9805114</subfield>
	</datafield>
	</record>""")

	def test_double_hep_no_semi_colon(self):
	ref_line = u"""[7] W. Fischler and L. Susskind, hep-th/9806039; N. Kaloper and A. Linde, Phys. Rev. D60 (1999) 105509, hep-th/9904120."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">7</subfield>
	<subfield code="h">W. Fischler and L. Susskind</subfield>
	<subfield code="r">hep-th/9806039</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">7</subfield>
	<subfield code="h">N. Kaloper and A. Linde</subfield>
	<subfield code="s">Phys.Rev.,D60,105509</subfield>
	<subfield code="r">hep-th/9904120</subfield>
	<subfield code="y">1999</subfield>
	</datafield>
	</record>""")

	def test_journal_colon_sep(self):
	ref_line = u"""[9] R. Bousso, JHEP 9906:028 (1999); hep-th/9906022."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">9</subfield>
	<subfield code="h">R. Bousso</subfield>
	<subfield code="s">JHEP,9906,028</subfield>
	<subfield code="y">1999</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">9</subfield>
	<subfield code="r">hep-th/9906022</subfield>
	</datafield>
	</record>""")

	def test_book1(self):
	"""book with authors and title but no quotes"""
	ref_line = u"""[10] R. Penrose and W. Rindler, Spinors and Spacetime, volume 2, chapter 9 (Cambridge University Press, Cambridge, 1986)."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">10</subfield>
	<subfield code="h">R. Penrose and W. Rindler</subfield>
	</datafield>
	</record>""")

	def test_hep_combined(self):
	ref_line = u"""[11] R. Britto-Pacumio, A. Strominger and A. Volovich, JHEP 9911:013 (1999); hep-th/9905210; blah hep-th/9905211; blah hep-ph/9711200"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">11</subfield>
	<subfield code="h">R. Britto-Pacumio, A. Strominger and A. Volovich</subfield>
	<subfield code="s">JHEP,9911,013</subfield>
	<subfield code="y">1999</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">11</subfield>
	<subfield code="r">hep-th/9905210</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">11</subfield>
	<subfield code="r">hep-th/9905211</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">11</subfield>
	<subfield code="r">hep-ph/9711200</subfield>
	</datafield>
	</record>""")

	def test_misc5(self):
	ref_line = u"""[12] V. Balasubramanian and P. Kraus, Commun. Math. Phys. 208 (1999) 413; hep-th/9902121."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">12</subfield>
	<subfield code="h">V. Balasubramanian and P. Kraus</subfield>
	<subfield code="s">Commun.Math.Phys.,208,413</subfield>
	<subfield code="y">1999</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">12</subfield>
	<subfield code="r">hep-th/9902121</subfield>
	</datafield>
	</record>""")

	def test_misc6(self):
	ref_line = u"""[13] V. Balasubramanian and P. Kraus, Phys. Rev. Lett. 83 (1999) 3605; hep-th/9903190."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">13</subfield>
	<subfield code="h">V. Balasubramanian and P. Kraus</subfield>
	<subfield code="s">Phys.Rev.Lett.,83,3605</subfield>
	<subfield code="y">1999</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">13</subfield>
	<subfield code="r">hep-th/9903190</subfield>
	</datafield>
	</record>""")

	def test_hep5(self):
	ref_line = u"""[14] P. Kraus, F. Larsen and R. Siebelink, hep-th/9906127."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">14</subfield>
	<subfield code="h">P. Kraus, F. Larsen and R. Siebelink</subfield>
	<subfield code="r">hep-th/9906127</subfield>
	</datafield>
	</record>""")

	def test_report1(self):
	ref_line = u"""[15] L. Randall and R. Sundrum, Phys. Rev. Lett. 83 (1999) 4690; hep-th/9906064. this is a test RN of a different type: CERN-LHC-Project-Report-2006. more text."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">15</subfield>
	<subfield code="h">L. Randall and R. Sundrum</subfield>
	<subfield code="s">Phys.Rev.Lett.,83,4690</subfield>
	<subfield code="y">1999</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">15</subfield>
	<subfield code="r">hep-th/9906064</subfield>
	<subfield code="r">CERN-LHC-Project-Report-2006</subfield>
	</datafield>
	</record>""")

	def test_hep6(self):
	ref_line = u"""[16] S. Gubser, hep-th/9912001."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">16</subfield>
	<subfield code="h">S. Gubser</subfield>
	<subfield code="r">hep-th/9912001</subfield>
	</datafield>
	</record>""")

	def test_triple_hep(self):
	ref_line = u"""[17] H. Verlinde, hep-th/9906182; H. Verlinde, hep-th/9912018; J. de Boer, E. Verlinde and H. Verlinde, hep-th/9912012."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">17</subfield>
	<subfield code="h">H. Verlinde</subfield>
	<subfield code="r">hep-th/9906182</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">17</subfield>
	<subfield code="h">H. Verlinde</subfield>
	<subfield code="r">hep-th/9912018</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">17</subfield>
	<subfield code="h">J. de Boer, E. Verlinde and H. Verlinde</subfield>
	<subfield code="r">hep-th/9912012</subfield>
	</datafield>
	</record>""")

	def test_url_no_tag(self):
	ref_line = u"""[18] E. Witten, remarks at ITP Santa Barbara conference, "New dimensions in field theory and string theory": http://www.itp.ucsb.edu/online/susyc99/discussion/."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">18</subfield>
	<subfield code="h">E. Witten</subfield>
	<subfield code="t">New dimensions in field theory and string theory</subfield>
	<subfield code="u">http://www.itp.ucsb.edu/online/susyc99/discussion/</subfield>
	</datafield>
	</record>""")

	def test_journal_simple(self):
	ref_line = u"""[19] D. Page and C. Pope, Commun. Math. Phys. 127 (1990) 529."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">19</subfield>
	<subfield code="h">D. Page and C. Pope</subfield>
	<subfield code="s">Commun.Math.Phys.,127,529</subfield>
	<subfield code="y">1990</subfield>
	</datafield>
	</record>""")

	def test_unknown_report(self):
	ref_line = u"""[20] M. Duff, B. Nilsson and C. Pope, Physics Reports 130 (1986), chapter 9."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">20</subfield>
	<subfield code="h">M. Duff, B. Nilsson and C. Pope</subfield>
	</datafield>
	</record>""")

	def test_journal_volume_with_letter(self):
	ref_line = u"""[21] D. Page, Phys. Lett. B79 (1978) 235."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">21</subfield>
	<subfield code="h">D. Page</subfield>
	<subfield code="s">Phys.Lett.,B79,235</subfield>
	<subfield code="y">1978</subfield>
	</datafield>
	</record>""")

	def test_journal_with_hep1(self):
	ref_line = u"""[22] M. Cassidy and S. Hawking, Phys. Rev. D57 (1998) 2372, hep-th/9709066; S. Hawking, Phys. Rev. D52 (1995) 5681."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">22</subfield>
	<subfield code="h">M. Cassidy and S. Hawking</subfield>
	<subfield code="s">Phys.Rev.,D57,2372</subfield>
	<subfield code="r">hep-th/9709066</subfield>
	<subfield code="y">1998</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">22</subfield>
	<subfield code="h">S. Hawking</subfield>
	<subfield code="s">Phys.Rev.,D52,5681</subfield>
	<subfield code="y">1995</subfield>
	</datafield>
	</record>""")

	def test_hep7(self):
	ref_line = u"""[23] K. Skenderis and S. Solodukhin, hep-th/9910023."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">23</subfield>
	<subfield code="h">K. Skenderis and S. Solodukhin</subfield>
	<subfield code="r">hep-th/9910023</subfield>
	</datafield>
	</record>""")

	def test_journal_with_hep2(self):
	ref_line = u"""[24] M. Henningson and K. Skenderis, JHEP 9807:023 (1998), hep-th/9806087."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">24</subfield>
	<subfield code="h">M. Henningson and K. Skenderis</subfield>
	<subfield code="s">JHEP,9807,023</subfield>
	<subfield code="r">hep-th/9806087</subfield>
	<subfield code="y">1998</subfield>
	</datafield>
	</record>""")

	def test_unknown_book(self):
	ref_line = u"""[25] C. Fefferman and C. Graham, "Conformal Invariants", in Elie Cartan et les Mathematiques d'aujourd'hui (Asterisque, 1985) 95."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">25</subfield>
	<subfield code="h">C. Fefferman and C. Graham</subfield>
	<subfield code="t">Conformal Invariants</subfield>
	</datafield>
	</record>""")

	def test_hep8(self):
	ref_line = u"""[27] E. Witten and S.-T. Yau, hep-th/9910245."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">27</subfield>
	<subfield code="h">E. Witten and S.-T. Yau</subfield>
	<subfield code="r">hep-th/9910245</subfield>
	</datafield>
	</record>""")

	def test_hep9(self):
	ref_line = u"""[28] R. Emparan, JHEP 9906:036 (1999); hep-th/9906040."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">28</subfield>
	<subfield code="h">R. Emparan</subfield>
	<subfield code="s">JHEP,9906,036</subfield>
	<subfield code="y">1999</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">28</subfield>
	<subfield code="r">hep-th/9906040</subfield>
	</datafield>
	</record>""")

	def test_journal_with_hep3(self):
	ref_line = u"""[29] A. Chamblin, R. Emparan, C. Johnson and R. Myers, Phys. Rev. D59 (1999) 64010, hep-th/9808177; S. Hawking, C. Hunter and D. Page, Phys. Rev. D59 (1998) 44033, hep-th/9809035."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">29</subfield>
	<subfield code="h">A. Chamblin, R. Emparan, C. Johnson and R. Myers</subfield>
	<subfield code="s">Phys.Rev.,D59,64010</subfield>
	<subfield code="r">hep-th/9808177</subfield>
	<subfield code="y">1999</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">29</subfield>
	<subfield code="h">S. Hawking, C. Hunter and D. Page</subfield>
	<subfield code="s">Phys.Rev.,D59,44033</subfield>
	<subfield code="r">hep-th/9809035</subfield>
	<subfield code="y">1998</subfield>
	</datafield>
	</record>""")

	def test_journal_with_hep4(self):
	ref_line = u"""[30] S. Sethi and L. Susskind, Phys. Lett. B400 (1997) 265, hep-th/9702101; T. Banks and N. Seiberg, Nucl. Phys. B497 (1997) 41, hep-th/9702187."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">30</subfield>
	<subfield code="h">S. Sethi and L. Susskind</subfield>
	<subfield code="s">Phys.Lett.,B400,265</subfield>
	<subfield code="r">hep-th/9702101</subfield>
	<subfield code="y">1997</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">30</subfield>
	<subfield code="h">T. Banks and N. Seiberg</subfield>
	<subfield code="s">Nucl.Phys.,B497,41</subfield>
	<subfield code="r">hep-th/9702187</subfield>
	<subfield code="y">1997</subfield>
	</datafield>
	</record>""")

	def test_misc7(self):
	ref_line = u"""[31] R. Emparan, C. Johnson and R. Myers, Phys. Rev. D60 (1999) 104001; hep-th/9903238."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">31</subfield>
	<subfield code="h">R. Emparan, C. Johnson and R. Myers</subfield>
	<subfield code="s">Phys.Rev.,D60,104001</subfield>
	<subfield code="y">1999</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">31</subfield>
	<subfield code="r">hep-th/9903238</subfield>
	</datafield>
	</record>""")

	def test_misc8(self):
	ref_line = u"""[32] S. Hawking, C. Hunter and M. Taylor-Robinson, Phys. Rev. D59 (1999) 064005; hep-th/9811056."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">32</subfield>
	<subfield code="h">S. Hawking, C. Hunter and M. Taylor-Robinson</subfield>
	<subfield code="s">Phys.Rev.,D59,064005</subfield>
	<subfield code="y">1999</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">32</subfield>
	<subfield code="r">hep-th/9811056</subfield>
	</datafield>
	</record>""")

	def test_misc9(self):
	ref_line = u"""[33] J. Dowker, Class. Quant. Grav. 16 (1999) 1937; hep-th/9812202."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">33</subfield>
	<subfield code="h">J. Dowker</subfield>
	<subfield code="s">Class.Quant.Grav.,16,1937</subfield>
	<subfield code="y">1999</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">33</subfield>
	<subfield code="r">hep-th/9812202</subfield>
	</datafield>
	</record>""")

	def test_journal3(self):
	ref_line = u"""[34] J. Brown and J. York, Phys. Rev. D47 (1993) 1407."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">34</subfield>
	<subfield code="h">J. Brown and J. York</subfield>
	<subfield code="s">Phys.Rev.,D47,1407</subfield>
	<subfield code="y">1993</subfield>
	</datafield>
	</record>""")

	def test_misc10(self):
	ref_line = u"""[35] D. Freedman, S. Mathur, A. Matsuis and L. Rastelli, Nucl. Phys. B546 (1999) 96; hep-th/9804058. More text, followed by an IBID A 546 (1999) 96"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">35</subfield>
	<subfield code="h">D. Freedman, S. Mathur, A. Matsuis and L. Rastelli</subfield>
	<subfield code="s">Nucl.Phys.,B546,96</subfield>
	<subfield code="y">1999</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">35</subfield>
	<subfield code="r">hep-th/9804058</subfield>
	<subfield code="h">D. Freedman, S. Mathur, A. Matsuis and L. Rastelli</subfield>
	<subfield code="s">Nucl.Phys.,A546,96</subfield>
	<subfield code="y">1999</subfield>
	</datafield>
	</record>""")

	def test_misc11(self):
	ref_line = u"""[36] D. Freedman, S. Mathur, A. Matsuis and L. Rastelli, Nucl. Phys. B546 (1999) 96; hep-th/9804058. More text, followed by an IBID A"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">36</subfield>
	<subfield code="h">D. Freedman, S. Mathur, A. Matsuis and L. Rastelli</subfield>
	<subfield code="s">Nucl.Phys.,B546,96</subfield>
	<subfield code="y">1999</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">36</subfield>
	<subfield code="r">hep-th/9804058</subfield>
	</datafield>
	</record>""")

	def test_misc12(self):
	ref_line = u"""[37] some misc lkjslkdjlksjflksj [hep-th/0703265] lkjlkjlkjlkj [hep-th/0606096], hep-ph/0002060, some more misc; Nucl. Phys. B546 (1999) 96"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">37</subfield>
	<subfield code="r">hep-th/0703265</subfield>
	+ <subfield code="0">93</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">37</subfield>
	<subfield code="r">hep-th/0606096</subfield>
	+ <subfield code="0">92</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">37</subfield>
	<subfield code="r">hep-ph/0002060</subfield>
	+ <subfield code="0">96</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">37</subfield>
	<subfield code="s">Nucl.Phys.,B546,96</subfield>
	<subfield code="y">1999</subfield>
	</datafield>
	</record>""")

	def test_misc13(self):
	ref_line = u"""[38] R. Emparan, C. Johnson and R.. Myers, Phys. Rev. D60 (1999) 104001; this is :: .... misc! hep-th/0703265. and some ...,.,.,.,::: more hep-th/0606096"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">38</subfield>
	<subfield code="h">R. Emparan, C. Johnson and R.. Myers</subfield>
	<subfield code="s">Phys.Rev.,D60,104001</subfield>
	<subfield code="y">1999</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">38</subfield>
	<subfield code="r">hep-th/0703265</subfield>
	+ <subfield code="0">93</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">38</subfield>
	<subfield code="r">hep-th/0606096</subfield>
	+ <subfield code="0">92</subfield>
	</datafield>
	</record>""")

	def test_misc14(self):
	"""Same as test_misc12 but with unknow report numbers to the system"""
	- ref_line = u"""[37] some misc lkjslkdjlksjflksj [hep-th/8703265] lkjlkjlkjlkj [hep-th/8606096], hep-ph/8002060, some more misc; Nucl. Phys. B546 (1999) 96"""
	+ ref_line = u"""[37] some misc lkjslkdjlksjflksj [hep-th/9206059] lkjlkjlkjlkj [hep-th/9206060], hep-ph/9206061, some more misc; Nucl. Phys. B546 (1999) 96"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">37</subfield>
	- <subfield code="r">hep-th/8703265</subfield>
	- <subfield code="r">hep-th/8606096</subfield>
	- <subfield code="r">hep-ph/8002060</subfield>
	+ <subfield code="r">hep-th/9206059</subfield>
	+ <subfield code="r">hep-th/9206060</subfield>
	+ <subfield code="r">hep-ph/9206061</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">37</subfield>
	<subfield code="s">Nucl.Phys.,B546,96</subfield>
	<subfield code="y">1999</subfield>
	</datafield>
	</record>""")

	def test_misc15(self):
	"""Same as test_misc13 but with unknow report numbers to the system"""
	- ref_line = u"""[38] R. Emparan, C. Johnson and R.. Myers, Phys. Rev. D60 (1999) 104001; this is :: .... misc! hep-th/8703265. and some ...,.,.,.,::: more hep-th/8606096"""
	+ ref_line = u"""[38] R. Emparan, C. Johnson and R.. Myers, Phys. Rev. D60 (1999) 104001; this is :: .... misc! hep-th/9206059. and some ...,.,.,.,::: more hep-th/9206060"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">38</subfield>
	<subfield code="h">R. Emparan, C. Johnson and R.. Myers</subfield>
	<subfield code="s">Phys.Rev.,D60,104001</subfield>
	<subfield code="y">1999</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">38</subfield>
	- <subfield code="r">hep-th/8703265</subfield>
	- <subfield code="r">hep-th/8606096</subfield>
	+ <subfield code="r">hep-th/9206059</subfield>
	+ <subfield code="r">hep-th/9206060</subfield>
	</datafield>
	</record>""")

	def test_journal_with_hep5(self):
	ref_line = u"""[39] A. Ceresole, G. Dall Agata and R. D Auria, JHEP 11(1999) 009, [hep-th/9907216]."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">39</subfield>
	<subfield code="h">A. Ceresole, G. Dall Agata and R. D Auria</subfield>
	<subfield code="s">JHEP,9911,009</subfield>
	<subfield code="r">hep-th/9907216</subfield>
	<subfield code="y">1999</subfield>
	</datafield>
	</record>""")

	def test_journal_with_hep6(self):
	ref_line = u"""[40] D.P. Jatkar and S. Randjbar-Daemi, Phys. Lett. B460, 281 (1999) [hep-th/9904187]."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">40</subfield>
	<subfield code="h">D.P. Jatkar and S. Randjbar-Daemi</subfield>
	<subfield code="s">Phys.Lett.,B460,281</subfield>
	<subfield code="r">hep-th/9904187</subfield>
	<subfield code="y">1999</subfield>
	</datafield>
	</record>""")

	def test_journal_with_hep7(self):
	ref_line = u"""[41] G. DallAgata, Phys. Lett. B460, (1999) 79, [hep-th/9904198]."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">41</subfield>
	<subfield code="h">G. DallAgata</subfield>
	<subfield code="s">Phys.Lett.,B460,79</subfield>
	<subfield code="r">hep-th/9904198</subfield>
	<subfield code="y">1999</subfield>
	</datafield>
	</record>""")

	def test_journal_year_volume_page(self):
	ref_line = u"""[43] Becchi C., Blasi A., Bonneau G., Collina R., Delduc F., Commun. Math. Phys., 1988, 120, 121."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">43</subfield>
	<subfield code="h">Becchi C., Blasi A., Bonneau G., Collina R., Delduc F.</subfield>
	<subfield code="s">Commun.Math.Phys.,120,121</subfield>
	<subfield code="y">1988</subfield>
	</datafield>
	</record>""")

	def test_journal_volume_year_page1(self):
	ref_line = u"""[44]: N. Nekrasov, A. Schwarz, Instantons on noncommutative R4 and (2, 0) superconformal six-dimensional theory, Comm. Math. Phys., 198, (1998), 689-703."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">44</subfield>
	<subfield code="h">N. Nekrasov, A. Schwarz</subfield>
	<subfield code="s">Commun.Math.Phys.,198,689</subfield>
	<subfield code="y">1998</subfield>
	</datafield>
	</record>""")

	def test_journal_volume_year_page2(self):
	ref_line = u"""[42] S.M. Donaldson, Instantons and Geometric Invariant Theory, Comm. Math. Phys., 93, (1984), 453-460."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">42</subfield>
	<subfield code="h">S.M. Donaldson</subfield>
	<subfield code="s">Commun.Math.Phys.,93,453</subfield>
	<subfield code="y">1984</subfield>
	</datafield>
	</record>""")

	def test_many_references_in_one_line(self):
	ref_line = u"""[45] H. J. Bhabha, Rev. Mod. Phys. 17, 200(1945); ibid, 21, 451(1949); S. Weinberg, Phys. Rev. 133, B1318(1964); ibid, 134, 882(1964); D. L. Pursey, Ann. Phys(U. S)32, 157(1965); W. K. Tung, Phys, Rev. Lett. 16, 763(1966); Phys. Rev. 156, 1385(1967); W. J. Hurley, Phys. Rev. Lett. 29, 1475(1972)."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">45</subfield>
	<subfield code="h">H. J. Bhabha</subfield>
	<subfield code="s">Rev.Mod.Phys.,17,200</subfield>
	<subfield code="y">1945</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">45</subfield>
	<subfield code="h">H. J. Bhabha</subfield>
	<subfield code="s">Rev.Mod.Phys.,21,451</subfield>
	<subfield code="y">1949</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">45</subfield>
	<subfield code="h">S. Weinberg</subfield>
	<subfield code="s">Phys.Rev.,133,B1318</subfield>
	<subfield code="y">1964</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">45</subfield>
	<subfield code="h">S. Weinberg</subfield>
	<subfield code="s">Phys.Rev.,134,882</subfield>
	<subfield code="y">1964</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">45</subfield>
	<subfield code="h">D. L. Pursey</subfield>
	<subfield code="s">Ann.Phys.,32,157</subfield>
	<subfield code="y">1965</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">45</subfield>
	<subfield code="h">W. K. Tung</subfield>
	<subfield code="s">Phys.Rev.Lett.,16,763</subfield>
	<subfield code="y">1966</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">45</subfield>
	<subfield code="s">Phys.Rev.,156,1385</subfield>
	<subfield code="y">1967</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">45</subfield>
	<subfield code="h">W. J. Hurley</subfield>
	<subfield code="s">Phys.Rev.Lett.,29,1475</subfield>
	<subfield code="y">1972</subfield>
	</datafield>
	</record>""")

	def test_ibid(self):
	+ """Simple ibid test"""
	ref_line = u"""[46] E. Schrodinger, Sitzungsber. Preuss. Akad. Wiss. Phys. Math. Kl. 24, 418(1930); ibid, 3, 1(1931)"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">46</subfield>
	<subfield code="h">E. Schrodinger</subfield>
	<subfield code="s">Sitzungsber.Preuss.Akad.Wiss.Berlin (Math.Phys.),24,418</subfield>
	<subfield code="y">1930</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">46</subfield>
	<subfield code="h">E. Schrodinger</subfield>
	<subfield code="s">Sitzungsber.Preuss.Akad.Wiss.Berlin (Math.Phys.),3,1</subfield>
	<subfield code="y">1931</subfield>
	</datafield>
	</record>""")

	+ def test_ibid2(self):
	+ "Series has to be recognized for ibid to work properly"
	+ ref_line = u"""[46] E. Schrodinger, J.Phys. G 24, 418 (1930); ibid, 3, 1(1931)"""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">46</subfield>
	+ <subfield code="h">E. Schrodinger</subfield>
	+ <subfield code="s">J.Phys.,G24,418</subfield>
	+ <subfield code="y">1930</subfield>
	+ </datafield>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">46</subfield>
	+ <subfield code="h">E. Schrodinger</subfield>
	+ <subfield code="s">J.Phys.,G3,1</subfield>
	+ <subfield code="y">1931</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_ibid3(self):
	+ "Series after volume has to be recognized for ibid to work properly"
	+ ref_line = u"""[46] E. Schrodinger, J.Phys. G 24, 418 (1930); ibid, 3, 1(1931)"""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">46</subfield>
	+ <subfield code="h">E. Schrodinger</subfield>
	+ <subfield code="s">J.Phys.,G24,418</subfield>
	+ <subfield code="y">1930</subfield>
	+ </datafield>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">46</subfield>
	+ <subfield code="h">E. Schrodinger</subfield>
	+ <subfield code="s">J.Phys.,G3,1</subfield>
	+ <subfield code="y">1931</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_ibid4(self):
	+ "Series has to be recognized for ibid to work properly"
	+ ref_line = u"""[46] E. Schrodinger, J.Phys. G 24, 418 (1930); ibid, A 3, 1(1931)"""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">46</subfield>
	+ <subfield code="h">E. Schrodinger</subfield>
	+ <subfield code="s">J.Phys.,G24,418</subfield>
	+ <subfield code="y">1930</subfield>
	+ </datafield>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">46</subfield>
	+ <subfield code="h">E. Schrodinger</subfield>
	+ <subfield code="s">J.Phys.,A3,1</subfield>
	+ <subfield code="y">1931</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_invalid_ibid(self):
	+ "Ibid with no preceding journals, needs to go to misc text"
	+ ref_line = u"""[46] E. Schrodinger, ibid, 3, 1(1931)"""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">46</subfield>
	+ <subfield code="h">E. Schrodinger</subfield>
	+ </datafield>
	+</record>""")
	+
	def test_misc4(self):
	ref_line = u"""[47] P. A. M. Dirac, Proc. R. Soc. London, Ser. A155, 447(1936); ibid, D24, 3333(1981)."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">47</subfield>
	<subfield code="h">P. A. M. Dirac</subfield>
	<subfield code="s">Proc.Roy.Soc.Lond.,A155,447</subfield>
	<subfield code="y">1936</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">47</subfield>
	<subfield code="h">P. A. M. Dirac</subfield>
	<subfield code="s">Proc.Roy.Soc.Lond.,D24,3333</subfield>
	<subfield code="y">1981</subfield>
	</datafield>
	</record>""")

	def test_doi(self):
	ref_line = u"""[48] O.O. Vaneeva, R.O. Popovych and C. Sophocleous, Enhanced Group Analysis and Exact Solutions of Vari-able Coefficient Semilinear Diffusion Equations with a Power Source, Acta Appl. Math., doi:10.1007/s10440-008-9280-9, 46 p., arXiv:0708.3457."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">48</subfield>
	<subfield code="h">O.O. Vaneeva, R.O. Popovych and C. Sophocleous</subfield>
	<subfield code="a">10.1007/s10440-008-9280-9</subfield>
	<subfield code="r">arXiv:0708.3457</subfield>
	</datafield>
	</record>""")

	+ def test_doi2(self):
	+ ref_line = u"""[1] http://dx.doi.org/10.1175/1520-0442(2000)013<2671:TAORTT>2.0.CO;2"""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">1</subfield>
	+ <subfield code="a">10.1175/1520-0442(2000)013<2671:TAORTT>2.0.CO;2</subfield>
	+ </datafield>
	+</record>""")
	+
	def test_misc3(self):
	ref_line = u"""[49] M. I. Trofimov, N. De Filippis and E. A. Smolenskii. Application of the electronegativity indices of organic molecules to tasks of chemical informatics. Russ. Chem. Bull., 54:2235-2246, 2005. http://dx.doi.org/10.1007/s11172-006-0105-6."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">49</subfield>
	<subfield code="h">M. I. Trofimov, N. De Filippis and E. A. Smolenskii</subfield>
	<subfield code="a">10.1007/s11172-006-0105-6</subfield>
	</datafield>
	</record>""")

	def test_misc2(self):
	ref_line = u"""[50] M. Gell-Mann, P. Ramon ans R. Slansky, in Supergravity, P. van Niewenhuizen and D. Freedman (North-Holland 1979); T. Yanagida, in Proceedings of the Workshop on the Unified Thoery and the Baryon Number in teh Universe, ed. O. Sawaga and A. Sugamoto (Tsukuba 1979); R.N. Mohapatra and G. Senjanovic, Phys. Rev. Lett. 44, 912, (1980)."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">50</subfield>
	- <subfield code="h">M. Gell-Mann, P. Ramon ans R. Slansky P. van Niewenhuizen and D. Freedman</subfield>
	+ <subfield code="h">M. Gell-Mann, P. Ramon ans R. Slansky</subfield>
	<subfield code="p">North-Holland</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">50</subfield>
	- <subfield code="h">T. Yanagida (O. Sawaga and A. Sugamoto (eds.))</subfield>
	+ <subfield code="h">T. Yanagida</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">50</subfield>
	<subfield code="h">R.N. Mohapatra and G. Senjanovic</subfield>
	<subfield code="s">Phys.Rev.Lett.,44,912</subfield>
	<subfield code="y">1980</subfield>
	</datafield>
	</record>""")

	def test_misc1(self):
	ref_line = u"""[51] L.S. Durkin and P. Langacker, Phys. Lett B166, 436 (1986); Amaldi et al., Phys. Rev. D36, 1385 (1987); Hayward and Yellow et al., eds. Phys. Lett B245, 669 (1990); Nucl. Phys. B342, 15 (1990);"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">51</subfield>
	<subfield code="h">L.S. Durkin and P. Langacker</subfield>
	<subfield code="s">Phys.Lett.,B166,436</subfield>
	<subfield code="y">1986</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">51</subfield>
	<subfield code="h">Amaldi et al.</subfield>
	<subfield code="s">Phys.Rev.,D36,1385</subfield>
	<subfield code="y">1987</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">51</subfield>
	<subfield code="h">(Hayward and Yellow et al. (eds.))</subfield>
	<subfield code="s">Phys.Lett.,B245,669</subfield>
	<subfield code="y">1990</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">51</subfield>
	<subfield code="s">Nucl.Phys.,B342,15</subfield>
	<subfield code="y">1990</subfield>
	</datafield>
	</record>""")

	def test_combination_of_authors_names(self):
	"""authors names in varied formats"""
	ref_line = u"""[53] Hush, D.R., R.Leighton, and B.G. Horne, 1993. "Progress in supervised Neural Netw. What's new since Lippmann?" IEEE Signal Process. Magazine 10, 8-39"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">53</subfield>
	<subfield code="h">Hush, D.R., R.Leighton, and B.G. Horne</subfield>
	<subfield code="t">Progress in supervised Neural Netw. What's new since Lippmann?</subfield>
	<subfield code="p">IEEE</subfield>
	</datafield>
	</record>""")

	def test_two_initials_no_space(self):
	ref_line = u"""[54] T.G. Rizzo, Phys. Rev. D40, 3035 (1989)"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">54</subfield>
	<subfield code="h">T.G. Rizzo</subfield>
	<subfield code="s">Phys.Rev.,D40,3035</subfield>
	<subfield code="y">1989</subfield>
	</datafield>
	</record>""")

	def test_surname_prefix_van(self):
	"""An author with prefix + surname
	e.g. van Niewenhuizen"""
	ref_line = u"""[55] Hawking S., P. van Niewenhuizen, L.S. Durkin, D. Freeman, some title of some journal"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">55</subfield>
	<subfield code="h">Hawking S., P. van Niewenhuizen, L.S. Durkin, D. Freeman</subfield>
	</datafield>
	</record>""")

	def test_authors_coma_but_no_journal(self):
	"""2 authors separated by coma"""
	ref_line = u"""[56] Hawking S., D. Freeman, some title of some journal"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">56</subfield>
	<subfield code="h">Hawking S., D. Freeman</subfield>
	</datafield>
	</record>""")

	def test_authors_and_but_no_journal(self):
	"""2 authors separated by "and" """
	ref_line = u"""[57] Hawking S. and D. Freeman, another random title of some random journal"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">57</subfield>
	<subfield code="h">Hawking S. and D. Freeman</subfield>
	</datafield>
	</record>""")

	def test_simple_et_al(self):
	"""author ending with et al."""
	ref_line = u"""[1] Amaldi et al., Phys. Rev. D36, 1385 (1987)"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">1</subfield>
	<subfield code="h">Amaldi et al.</subfield>
	<subfield code="s">Phys.Rev.,D36,1385</subfield>
	<subfield code="y">1987</subfield>
	</datafield>
	</record>""")

	- def test_ibidem(self):
	+ def test_ibid_two_journals(self):
	"""IBIDEM test

	ibidem must copy the previous reference journal and not
	the first one
	"""
	ref_line = u"""[58] Nucl. Phys. B342, 15 (1990); Phys. Lett. B261, 146 (1991); ibidem B263, 459 (1991);"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">58</subfield>
	<subfield code="s">Nucl.Phys.,B342,15</subfield>
	<subfield code="y">1990</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">58</subfield>
	<subfield code="s">Phys.Lett.,B261,146</subfield>
	<subfield code="y">1991</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">58</subfield>
	<subfield code="s">Phys.Lett.,B263,459</subfield>
	<subfield code="y">1991</subfield>
	</datafield>
	</record>""")

	def test_collaboration(self):
	"""collaboration"""
	ref_line = u"""[60] HERMES Collaboration, Airapetian A et al. 2005 Phys. Rev. D 71 012003 1-36"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">60</subfield>
	- <subfield code="h">(HERMES Collaboration) Airapetian A et al.</subfield>
	+ <subfield code="c">HERMES Collaboration</subfield>
	+ <subfield code="h">Airapetian A et al.</subfield>
	<subfield code="s">Phys.Rev.,D71,012003</subfield>
	<subfield code="y">2005</subfield>
	</datafield>
	</record>""")

	def test_weird_number_after_volume(self):
	ref_line = u"""[61] de Florian D, Sassot R and Stratmann M 2007 Phys. Rev. D 75 114010 1-26"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">61</subfield>
	<subfield code="h">de Florian D, Sassot R and Stratmann M</subfield>
	<subfield code="s">Phys.Rev.,D75,114010</subfield>
	<subfield code="y">2007</subfield>
	</datafield>
	</record>""")

	def test_year_before_journal(self):
	ref_line = u"""[64] Bourrely C, Soffer J and Buccella F 2002 Eur. Phys. J. C 23 487-501"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">64</subfield>
	<subfield code="h">Bourrely C, Soffer J and Buccella F</subfield>
	<subfield code="s">Eur.Phys.J.,C23,487</subfield>
	<subfield code="y">2002</subfield>
	</datafield>
	</record>""")

	def test_non_recognized_reference(self):
	ref_line = u"""[63] Z. Guzik and R. Jacobsson, LHCb Readout Supervisor ’ODIN’ with a L1\nTrigger - Technical reference, Aug 2005, EDMS 704078-V1.0"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">63</subfield>
	<subfield code="h">Z. Guzik and R. Jacobsson</subfield>
	</datafield>
	</record>""")

	def test_year_stuck_to_volume(self):
	ref_line = u"""[65] K. Huang, Am. J. Phys. 20, 479(1952)"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">65</subfield>
	<subfield code="h">K. Huang</subfield>
	<subfield code="s">Am.J.Phys.,20,479</subfield>
	<subfield code="y">1952</subfield>
	</datafield>
	</record>""")

	def test_two_initials_after_surname(self):
	"""Author with 2 initials
	e.g. Pate S. F."""
	ref_line = u"""[62] Pate S. F., McKee D. W. and Papavassiliou V. 2008 Phys.Rev. C 78 448"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">62</subfield>
	<subfield code="h">Pate S. F., McKee D. W. and Papavassiliou V.</subfield>
	<subfield code="s">Phys.Rev.,C78,448</subfield>
	<subfield code="y">2008</subfield>
	</datafield>
	</record>""")

	def test_one_initial_after_surname(self):
	"""Author with 1 initials
	e.g. Pate S."""
	ref_line = u"""[62] Pate S., McKee D., 2008 Phys.Rev. C 78 448"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">62</subfield>
	<subfield code="h">Pate S., McKee D.</subfield>
	<subfield code="s">Phys.Rev.,C78,448</subfield>
	<subfield code="y">2008</subfield>
	</datafield>
	</record>""")

	def test_two_initials_no_dot_after_surname(self):
	"""Author with 2 initials
	e.g. Pate S F"""
	ref_line = u"""[62] Pate S F, McKee D W and Papavassiliou V 2008 Phys.Rev. C 78 448"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">62</subfield>
	<subfield code="h">Pate S F, McKee D W and Papavassiliou V</subfield>
	<subfield code="s">Phys.Rev.,C78,448</subfield>
	<subfield code="y">2008</subfield>
	</datafield>
	</record>""")

	def test_one_initial_no_dot_after_surname(self):
	"""Author with 1 initials
	e.g. Pate S"""
	ref_line = u"""[62] Pate S, McKee D, 2008 Phys.Rev. C 78 448"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">62</subfield>
	<subfield code="h">Pate S, McKee D</subfield>
	<subfield code="s">Phys.Rev.,C78,448</subfield>
	<subfield code="y">2008</subfield>
	</datafield>
	</record>""")

	def test_two_initials_before_surname(self):
	ref_line = u"""[67] G. A. Perkins, Found. Phys. 6, 237(1976)"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">67</subfield>
	<subfield code="h">G. A. Perkins</subfield>
	<subfield code="s">Found.Phys.,6,237</subfield>
	<subfield code="y">1976</subfield>
	</datafield>
	</record>""")

	def test_one_initial_before_surname(self):
	ref_line = u"""[67] G. Perkins, Found. Phys. 6, 237(1976)"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">67</subfield>
	<subfield code="h">G. Perkins</subfield>
	<subfield code="s">Found.Phys.,6,237</subfield>
	<subfield code="y">1976</subfield>
	</datafield>
	</record>""")

	def test_two_initials_no_dot_before_surname(self):
	ref_line = u"""[67] G A Perkins, Found. Phys. 6, 237(1976)"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">67</subfield>
	<subfield code="h">G A Perkins</subfield>
	<subfield code="s">Found.Phys.,6,237</subfield>
	<subfield code="y">1976</subfield>
	</datafield>
	</record>""")

	def test_one_initial_no_dot_before_surname(self):
	ref_line = u"""[67] G Perkins, Found. Phys. 6, 237(1976)"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">67</subfield>
	<subfield code="h">G Perkins</subfield>
	<subfield code="s">Found.Phys.,6,237</subfield>
	<subfield code="y">1976</subfield>
	</datafield>
	</record>""")

	def test_ibid_twice(self):
	ref_line = u"""[68] A. O. Barut et al, Phys. Rev. D23, 2454(1981); ibid, D24, 3333(1981); ibid, D31, 1386(1985)"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">68</subfield>
	<subfield code="h">A. O. Barut et al.</subfield>
	<subfield code="s">Phys.Rev.,D23,2454</subfield>
	<subfield code="y">1981</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">68</subfield>
	<subfield code="h">A. O. Barut et al.</subfield>
	<subfield code="s">Phys.Rev.,D24,3333</subfield>
	<subfield code="y">1981</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">68</subfield>
	<subfield code="h">A. O. Barut et al.</subfield>
	<subfield code="s">Phys.Rev.,D31,1386</subfield>
	<subfield code="y">1985</subfield>
	</datafield>
	</record>""")

	def test_no_authors(self):
	ref_line = u"""[69] Phys. Rev. Lett. 52, 2009(1984)"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">69</subfield>
	<subfield code="s">Phys.Rev.Lett.,52,2009</subfield>
	<subfield code="y">1984</subfield>
	</datafield>
	</record>""")

	def test_extra_01(self):
	"Parsed erroniously as Phys.Rev.Lett.,101,01"
	ref_line = u"""[17] de Florian D, Sassot R, Stratmann M and Vogelsang W 2008 Phys. Rev. Lett. 101 072001 1-4; 2009 Phys.
	Rev. D 80 034030 1-25"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">17</subfield>
	<subfield code="h">de Florian D, Sassot R, Stratmann M and Vogelsang W</subfield>
	<subfield code="s">Phys.Rev.Lett.,101,072001</subfield>
	<subfield code="y">2008</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">17</subfield>
	<subfield code="s">Phys.Rev.,D80,034030</subfield>
	<subfield code="y">2009</subfield>
	</datafield>
	</record>""")

	def test_extra_no_after_vol(self):
	ref_line = u"""[130] A. Kuper, H. Letaw, L. Slifkin, E-Sonder, and C. T. Tomizuka, “Self- diffusion in copper,” Physical Review, vol. 96, no. 5, pp. 1224–1225, 1954."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">130</subfield>
	<subfield code="h">A. Kuper, H. Letaw, L. Slifkin, E-Sonder, and C. T. Tomizuka</subfield>
	<subfield code="t">Self- diffusion in copper</subfield>
	<subfield code="s">Phys.Rev.,96,1224</subfield>
	<subfield code="y">1954</subfield>
	</datafield>
	</record>""")

	def test_jinst(self):
	ref_line = u"""[1] ATLAS Collaboration, G. Aad et al., The ATLAS Experiment at the CERN Large Hadron Collider, JINST 3 (2008) S08003."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">1</subfield>
	- <subfield code="h">(ATLAS Collaboration) G. Aad et al.</subfield>
	+ <subfield code="c">ATLAS Collaboration</subfield>
	+ <subfield code="h">G. Aad et al.</subfield>
	<subfield code="s">JINST,3,S08003</subfield>
	<subfield code="y">2008</subfield>
	</datafield>
	</record>""")

	def test_collaboration2(self):
	ref_line = u"""[28] Particle Data Group Collaboration, K. Nakamura et al., Review of particle physics, J. Phys. G37 (2010) 075021."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">28</subfield>
	- <subfield code="h">(Particle Data Group Collaboration) K. Nakamura et al.</subfield>
	+ <subfield code="c">Particle Data Group Collaboration</subfield>
	+ <subfield code="h">K. Nakamura et al.</subfield>
	<subfield code="s">J.Phys.,G37,075021</subfield>
	<subfield code="y">2010</subfield>
	</datafield>
	</record>""")

	def test_sub_volume(self):
	ref_line = u"""[8] S. Horvat, D. Khartchenko, O. Kortner, S. Kotov, H. Kroha, A. Manz, S. Mohrdieck-Mock, K. Nikolaev, R. Richter, W. Stiller, C. Valderanis, J. Dubbert, F. Rauscher, and A. Staude, Operation of the ATLAS muon drift-tube chambers at high background rates and in magnetic fields, IEEE Trans. Nucl. Sci. 53 (2006) no. 2, 562–566"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">8</subfield>
	<subfield code="h">S. Horvat, D. Khartchenko, O. Kortner, S. Kotov, H. Kroha, A. Manz, S. Mohrdieck-Mock, K. Nikolaev, R. Richter, W. Stiller, C. Valderanis, J. Dubbert, F. Rauscher, and A. Staude</subfield>
	<subfield code="s">IEEE Trans.Nucl.Sci.,53,562</subfield>
	<subfield code="y">2006</subfield>
	</datafield>
	</record>""")

	def test_journal_not_recognized(self):
	ref_line = u"""[33] A. Moraes, C. Buttar, and I. Dawson, Prediction for minimum bias and the underlying event at LHC energies, The European Physical Journal C - Particles and Fields 50 (2007) 435–466."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">33</subfield>
	<subfield code="h">A. Moraes, C. Buttar, and I. Dawson</subfield>
	<subfield code="s">Eur.Phys.J.,C50,435</subfield>
	<subfield code="y">2007</subfield>
	</datafield>
	</record>""")

	def test_multiple_eds(self):
	ref_line = u"""[7] L. Evans, (ed.) and P. Bryant, (ed.), LHC Machine, JINST 3 (2008) S08001."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">7</subfield>
	<subfield code="h">L. Evans, (ed.) and P. Bryant, (ed.)</subfield>
	<subfield code="s">JINST,3,S08001</subfield>
	<subfield code="y">2008</subfield>
	</datafield>
	</record>""")

	def test_atlas_conf(self):
	"""not recognizing preprint format"""
	ref_line = u"""[32] The ATLAS Collaboration, Charged particle multiplicities in pp interactions at √s = 0.9 and 7 TeV in a diffractive limited phase space measured with the ATLAS detector at the LHC and a new pythia6 tune, 2010. http://cdsweb.cern.ch/record/1266235/files/ ATLAS-COM-CONF-2010-031.pdf. ATLAS-CONF-2010-031."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">32</subfield>
	- <subfield code="h">(The ATLAS Collaboration)</subfield>
	+ <subfield code="c">ATLAS Collaboration</subfield>
	<subfield code="u">http://cdsweb.cern.ch/record/1266235/files/ATLAS-COM-CONF-2010-031.pdf</subfield>
	- <subfield code="r">ATL-CONF-2010-031</subfield>
	+ <subfield code="r">ATLAS-CONF-2010-031</subfield>
	</datafield>
	</record>""")

	def test_journal_of_physics(self):
	"""eventually not recognizing the journal, the collaboration or authors"""
	ref_line = u"""[19] ATLAS Inner Detector software group Collaboration, T. Cornelissen, M. Elsing, I. Gavilenko, W. Liebig, E. Moyse, and A. Salzburger, The new ATLAS Track Reconstruction (NEWT), Journal of Physics 119 (2008) 032014."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">19</subfield>
	- <subfield code="h">(ATLAS Inner Detector software group Collaboration) T. Cornelissen, M. Elsing, I. Gavilenko, W. Liebig, E. Moyse, and A. Salzburger</subfield>
	+ <subfield code="c">ATLAS Inner Detector software group Collaboration</subfield>
	+ <subfield code="h">T. Cornelissen, M. Elsing, I. Gavilenko, W. Liebig, E. Moyse, and A. Salzburger</subfield>
	<subfield code="s">J.Phys.,119,032014</subfield>
	<subfield code="y">2008</subfield>
	</datafield>
	</record>""")

	def test_jhep(self):
	"""was splitting JHEP in JHE: P"""
	ref_line = u"""[22] G. P. Salam and G. Soyez, A practical seedless infrared-safe cone jet algorithm, JHEP 05 (2007) 086."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">22</subfield>
	<subfield code="h">G. P. Salam and G. Soyez</subfield>
	<subfield code="s">JHEP,0705,086</subfield>
	<subfield code="y">2007</subfield>
	</datafield>
	</record>""")

	def test_journal_not_recognized2(self):
	ref_line = u"""[3] Physics Performance Report Vol 1 – J. Phys. G. Vol 30 N° 11 (2004) 232"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">3</subfield>
	<subfield code="s">J.Phys.,G30,232</subfield>
	<subfield code="y">2004</subfield>
	</datafield>
	</record>""")

	def test_journal_not_recognized3(self):
	ref_line = u"""[3] Physics Performance Report Vol 1 – J. Phys. G. N° 30 (2004) 232"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">3</subfield>
	<subfield code="s">J.Phys.,G30,232</subfield>
	<subfield code="y">2004</subfield>
	</datafield>
	</record>""")

	def test_journal_not_recognized4(self):
	ref_line = u"""[128] D. P. Pritzkau and R. H. Siemann, “Experimental study of rf pulsed heat- ing on oxygen free electronic copper,” Physical Review Special Topics - Accelerators and Beams, vol. 5, pp. 1–22, 2002."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">128</subfield>
	<subfield code="h">D. P. Pritzkau and R. H. Siemann</subfield>
	<subfield code="t">Experimental study of rf pulsed heat- ing on oxygen free electronic copper</subfield>
	<subfield code="s">Phys.Rev.ST Accel.Beams,5,1</subfield>
	<subfield code="y">2002</subfield>
	</datafield>
	</record>""")

	def test_journal_not_recognized5(self):
	ref_line = u"""[128] D. P. Pritzkau and R. H. Siemann, Phys.Lett. 100B (1981), 117"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">128</subfield>
	<subfield code="h">D. P. Pritzkau and R. H. Siemann</subfield>
	<subfield code="s">Phys.Lett.,B100,117</subfield>
	<subfield code="y">1981</subfield>
	</datafield>
	</record>""")

	def test_note_format1(self):
	ref_line = u"""[91] S. Calatroni, H. Neupert, and M. Taborelli, “Fatigue testing of materials by UV pulsed laser irradiation,” CLIC Note 615, CERN, 2004."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">91</subfield>
	<subfield code="h">S. Calatroni, H. Neupert, and M. Taborelli</subfield>
	<subfield code="t">Fatigue testing of materials by UV pulsed laser irradiation</subfield>
	<subfield code="r">CERN-CLIC-Note-615</subfield>
	</datafield>
	</record>""")

	def test_note_format2(self):
	ref_line = u"""[5] H. Braun, R. Corsini, J. P. Delahaye, A. de Roeck, S. Dbert, A. Ferrari, G. Geschonke, A. Grudiev, C. Hauviller, B. Jeanneret, E. Jensen, T. Lefvre, Y. Papaphilippou, G. Riddone, L. Rinolfi, W. D. Schlatter, H. Schmickler, D. Schulte, I. Syratchev, M. Taborelli, F. Tecker, R. Toms, S. Weisz, and W. Wuensch, “CLIC 2008 parameters,” tech. rep., CERN CLIC-Note-764, Oct 2008."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">5</subfield>
	<subfield code="h">H. Braun, R. Corsini, J. P. Delahaye, A. de Roeck, S. Dbert, A. Ferrari, G. Geschonke, A. Grudiev, C. Hauviller, B. Jeanneret, E. Jensen, T. Lefvre, Y. Papaphilippou, G. Riddone, L. Rinolfi, W. D. Schlatter, H. Schmickler, D. Schulte, I. Syratchev, M. Taborelli, F. Tecker, R. Toms, S. Weisz, and W. Wuensch</subfield>
	<subfield code="t">CLIC 2008 parameters</subfield>
	<subfield code="r">CERN-CLIC-Note-764</subfield>
	</datafield>
	</record>""")

	def test_remove_empty_misc_tag(self):
	ref_line = u"""[21] “http://www.linearcollider.org/.”"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">21</subfield>
	<subfield code="u">http://www.linearcollider.org/</subfield>
	</datafield>
	</record>""", ignore_misc=False)

	def test_sub_volume_not_recognized(self):
	ref_line = u"""[37] L. Lu, Y. Shen, X. Chen, L. Qian, and K. Lu, “Ultrahigh strength and high electrical conductivity in copper,” Science, vol. 304, no. 5669, pp. 422–426, 2004."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">37</subfield>
	<subfield code="h">L. Lu, Y. Shen, X. Chen, L. Qian, and K. Lu</subfield>
	<subfield code="t">Ultrahigh strength and high electrical conductivity in copper</subfield>
	<subfield code="s">Science,304,422</subfield>
	<subfield code="y">2004</subfield>
	</datafield>
	</record>""")

	def test_extra_a_after_journal(self):
	ref_line = u"""[28] Particle Data Group Collaboration, K. Nakamura et al., Review of particle physics, J. Phys. G37 (2010) 075021."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">28</subfield>
	- <subfield code="h">(Particle Data Group Collaboration) K. Nakamura et al.</subfield>
	+ <subfield code="c">Particle Data Group Collaboration</subfield>
	+ <subfield code="h">K. Nakamura et al.</subfield>
	<subfield code="s">J.Phys.,G37,075021</subfield>
	<subfield code="y">2010</subfield>
	</datafield>
	</record>""")

	def test_full_month_with_volume(self):
	ref_line = u"""[2] C. Rubbia, Experimental observation of the intermediate vector bosons W+, W−, and Z0, Reviews of Modern Physics 57 (July, 1985) 699–722."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">2</subfield>
	<subfield code="h">C. Rubbia</subfield>
	<subfield code="s">Rev.Mod.Phys.,57,699</subfield>
	<subfield code="y">1985</subfield>
	</datafield>
	</record>""")

	def test_wrong_replacement(self):
	"""Wrong replacement

	A. J. Hey, Gauge by Astron.J. Hey
	"""
	ref_line = u"""[5] I. J. Aitchison and A. J. Hey, Gauge Theories in Particle Physics, Vol II: QCD and the Electroweak Theory. CRC Pr I Llc, 2003."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">5</subfield>
	<subfield code="h">I. J. Aitchison and A. J. Hey</subfield>
	<subfield code="p">CRC Pr.</subfield>
	</datafield>
	</record>""")

	def test_author_replacement(self):
	ref_line = u"""[48] D. Adams, S. Asai, D. Cavalli, M. Du ̈hrssen, K. Edmonds, S. Elles, M. Fehling, U. Felzmann, L. Gladilin, L. Helary, M. Hohlfeld, S. Horvat, K. Jakobs, M. Kaneda, G. Kirsch, S. Kuehn, J. F. Marchand, C. Pizio, X. Portell, D. Rebuzzi, E. Schmidt, A. Shibata, I. Vivarelli, S. Winkelmann, and S. Yamamoto, The ATLFAST-II performance in release 14 -particle signatures and selected benchmark processes-, Tech. Rep. ATL-PHYS-INT-2009-110, CERN, Geneva, Dec, 2009."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">48</subfield>
	<subfield code="h">D. Adams, S. Asai, D. Cavalli, M. D\xfchrssen, K. Edmonds, S. Elles, M. Fehling, U. Felzmann, L. Gladilin, L. Helary, M. Hohlfeld, S. Horvat, K. Jakobs, M. Kaneda, G. Kirsch, S. Kuehn, J. F. Marchand, C. Pizio, X. Portell, D. Rebuzzi, E. Schmidt, A. Shibata, I. Vivarelli, S. Winkelmann, and S. Yamamoto</subfield>
	<subfield code="r">ATL-PHYS-INT-2009-110</subfield>
	</datafield>
	</record>""")

	def test_author_not_recognized1(self):
	ref_line = u"""[7] Pod I., C. Jennings, et al, etc., Nucl. Phys. B342, 15 (1990)"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">7</subfield>
	<subfield code="h">Pod I., C. Jennings, et al.</subfield>
	<subfield code="s">Nucl.Phys.,B342,15</subfield>
	<subfield code="y">1990</subfield>
	</datafield>
	</record>""")

	def test_title_comma(self):
	ref_line = u"""[24] R. Downing et al., Nucl. Instrum. Methods, A570, 36 (2007)."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">24</subfield>
	<subfield code="h">R. Downing et al.</subfield>
	<subfield code="s">Nucl.Instrum.Meth.,A570,36</subfield>
	<subfield code="y">2007</subfield>
	</datafield>
	</record>""")

	def test_author1(self):
	ref_line = u"""[43] L.S. Durkin and P. Langacker, Phys. Lett B166, 436 (1986); Amaldi et al., Phys. Rev. D36, 1385 (1987); Hayward and Yellow et al., Phys. Lett B245, 669 (1990); Nucl. Phys. B342, 15 (1990);"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">43</subfield>
	<subfield code="h">L.S. Durkin and P. Langacker</subfield>
	<subfield code="s">Phys.Lett.,B166,436</subfield>
	<subfield code="y">1986</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">43</subfield>
	<subfield code="h">Amaldi et al.</subfield>
	<subfield code="s">Phys.Rev.,D36,1385</subfield>
	<subfield code="y">1987</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">43</subfield>
	<subfield code="h">Hayward and Yellow et al.</subfield>
	<subfield code="s">Phys.Lett.,B245,669</subfield>
	<subfield code="y">1990</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">43</subfield>
	<subfield code="s">Nucl.Phys.,B342,15</subfield>
	<subfield code="y">1990</subfield>
	</datafield>
	</record>""")

	def test_author2(self):
	ref_line = u"""[15] Nucl. Phys., B372, 3 (1992); T.G. Rizzo, Phys. Rev. D40, 3035 (1989); Proceedings of the 1990 Summer Study on High Energy Physics. ed E. Berger, June 25-July 13, 1990, Snowmass Colorado (World Scientific, Singapore, 1992) p. 233; V. Barger, J.L. Hewett and T.G. Rizzo, Phys. Rev. D42, 152 (1990); J.L. Hewett, Phys. Lett. B238, 98 (1990)"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">15</subfield>
	<subfield code="s">Nucl.Phys.,B372,3</subfield>
	<subfield code="y">1992</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">15</subfield>
	<subfield code="h">T.G. Rizzo</subfield>
	<subfield code="s">Phys.Rev.,D40,3035</subfield>
	<subfield code="y">1989</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">15</subfield>
	<subfield code="h">(E. Berger (eds.))</subfield>
	<subfield code="p">World Scientific</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">15</subfield>
	<subfield code="h">V. Barger, J.L. Hewett and T.G. Rizzo</subfield>
	<subfield code="s">Phys.Rev.,D42,152</subfield>
	<subfield code="y">1990</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">15</subfield>
	<subfield code="h">J.L. Hewett</subfield>
	<subfield code="s">Phys.Lett.,B238,98</subfield>
	<subfield code="y">1990</subfield>
	</datafield>
	</record>""")

	def test_merging(self):
	"""Test how references are merged together

	We may choose to merge invalid references to the previous one"""
	ref_line = u"""[15] Nucl. Phys., B372, 3 (1992); T.G. Rizzo, Phys. Rev. D40, 3035 (1989); Proceedings of the 1990 Summer Study on High Energy Physics; ed E. Berger; V. Barger, J.L. Hewett and T.G. Rizzo ; Phys. Rev. D42, 152 (1990); J.L. Hewett, Phys. Lett. B238, 98 (1990)"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">15</subfield>
	<subfield code="s">Nucl.Phys.,B372,3</subfield>
	<subfield code="y">1992</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">15</subfield>
	<subfield code="h">T.G. Rizzo</subfield>
	<subfield code="s">Phys.Rev.,D40,3035</subfield>
	<subfield code="y">1989</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">15</subfield>
	<subfield code="m">Proceedings of the 1990 Summer Study on High Energy Physics</subfield>
	+ </datafield>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">15</subfield>
	<subfield code="h">(E. Berger (eds.))</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">15</subfield>
	<subfield code="h">V. Barger, J.L. Hewett and T.G. Rizzo</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">15</subfield>
	<subfield code="s">Phys.Rev.,D42,152</subfield>
	<subfield code="y">1990</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">15</subfield>
	<subfield code="h">J.L. Hewett</subfield>
	<subfield code="s">Phys.Lett.,B238,98</subfield>
	<subfield code="y">1990</subfield>
	</datafield>
	</record>""", ignore_misc=False)

	def test_merging2(self):
	ref_line = u"""[15] Nucl. Phys., B372, 3 (1992); hello world"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">15</subfield>
	- <subfield code="m">hello world</subfield>
	<subfield code="s">Nucl.Phys.,B372,3</subfield>
	<subfield code="y">1992</subfield>
	</datafield>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">15</subfield>
	+ <subfield code="m">hello world</subfield>
	+ </datafield>
	</record>""", ignore_misc=False)

	def test_merging3(self):
	ref_line = u"""[15] Nucl. Phys., B372, 3 (1992); hello world T.G. Rizzo foo"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">15</subfield>
	<subfield code="s">Nucl.Phys.,B372,3</subfield>
	<subfield code="y">1992</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">15</subfield>
	+ <subfield code="m">hello world foo</subfield>
	<subfield code="h">T.G. Rizzo</subfield>
	- <subfield code="m">hello world foo</subfield>
	</datafield>
	</record>""", ignore_misc=False)

	def test_merging4(self):
	ref_line = u"""[15] T.G. Rizzo; Nucl. Phys., B372, 3 (1992)"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">15</subfield>
	<subfield code="h">T.G. Rizzo</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">15</subfield>
	<subfield code="s">Nucl.Phys.,B372,3</subfield>
	<subfield code="y">1992</subfield>
	</datafield>
	</record>""", ignore_misc=False)

	+ def test_merging5(self):
	+ ref_line = u"""[39] C. Arnaboldi et al., Nucl. Instrum. Meth. A 518 (2004) 775
	+[hep-ex/0212053]; M. Sisti [CUORE Collaboration], J. Phys. Conf. Ser. 203 (2010)
	+012069; F. Bellini, C. Bucci, S. Capelli, O. Cremonesi, L. Gironi, M. Martinez, M. Pavan
	+and C. Tomei et al., Astropart. Phys. 33 (2010) 169 [arXiv:0912.0452 [physics.ins-det]]."""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">39</subfield>
	+ <subfield code="h">C. Arnaboldi et al.</subfield>
	+ <subfield code="s">Nucl.Instrum.Meth.,A518,775</subfield>
	+ <subfield code="r">hep-ex/0212053</subfield>
	+ <subfield code="y">2004</subfield>
	+ </datafield>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">39</subfield>
	+ <subfield code="h">M. Sisti</subfield>
	+ <subfield code="c">CUORE Collaboration</subfield>
	+ <subfield code="m">J. Phys. Conf. Ser. 203 (2010)012069</subfield>
	+ </datafield>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">39</subfield>
	+ <subfield code="h">F. Bellini, C. Bucci, S. Capelli, O. Cremonesi, L. Gironi, M. Martinez, M. Pavanand C. Tomei et al.</subfield>
	+ <subfield code="m">Astropart. Phys. 33 (2010) 169</subfield>
	+ <subfield code="r">arXiv:0912.0452 [physics.ins-det]</subfield>
	+ </datafield>
	+</record>""", ignore_misc=False)
	+
	def test_extra_blank_reference(self):
	ref_line = u"""[26] U. Gursoy and E. Kiritsis, “Exploring improved holographic theories for QCD: Part I,” JHEP 0802 (2008) 032 [ArXiv:0707.1324][hep-th]; U. Gursoy, E. Kiritsis and F. Nitti, “Exploring improved holographic theories for QCD: Part II,” JHEP 0802 (2008) 019 [ArXiv:0707.1349][hep-th];"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">26</subfield>
	<subfield code="h">U. Gursoy and E. Kiritsis</subfield>
	- <subfield code="t">Exploring improved holographic theories for QCD Part I</subfield>
	+ <subfield code="t">Exploring improved holographic theories for QCD: Part I</subfield>
	<subfield code="s">JHEP,0802,032</subfield>
	<subfield code="r">arXiv:0707.1324</subfield>
	+ <subfield code="m">[hep-th]</subfield>
	<subfield code="y">2008</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">26</subfield>
	<subfield code="h">U. Gursoy, E. Kiritsis and F. Nitti</subfield>
	- <subfield code="t">Exploring improved holographic theories for QCD Part II</subfield>
	+ <subfield code="t">Exploring improved holographic theories for QCD: Part II</subfield>
	<subfield code="s">JHEP,0802,019</subfield>
	<subfield code="r">arXiv:0707.1349</subfield>
	+ <subfield code="m">[hep-th]</subfield>
	<subfield code="y">2008</subfield>
	</datafield>
	-</record>""")
	+</record>""", ignore_misc=False)

	def test_invalid_author(self):
	"""used to detected invalid author as at Finite T"""
	ref_line = u"""[23] A. Taliotis, “qq ̄ Potential at Finite T and Weak Coupling in N = 4,” Phys. Rev. C83, 045204 (2011). [ArXiv:1011.6618][hep-th]."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">23</subfield>
	<subfield code="h">A. Taliotis</subfield>
	<subfield code="t">qq \u0304 Potential at Finite T and Weak Coupling in N = 4</subfield>
	<subfield code="s">Phys.Rev.,C83,045204</subfield>
	<subfield code="r">arXiv:1011.6618</subfield>
	<subfield code="y">2011</subfield>
	</datafield>
	</record>""")

	def test_split_arxiv(self):
	"""used to split arxiv reference from its reference"""
	ref_line = u"""[18] A. Taliotis, “DIS from the AdS/CFT correspondence,” Nucl. Phys. A830, 299C-302C (2009). [ArXiv:0907.4204][hep-th]."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">18</subfield>
	<subfield code="h">A. Taliotis</subfield>
	<subfield code="t">DIS from the AdS/CFT correspondence</subfield>
	<subfield code="s">Nucl.Phys.,A830,299C</subfield>
	<subfield code="r">arXiv:0907.4204</subfield>
	<subfield code="y">2009</subfield>
	</datafield>
	</record>""")

	def test_report_without_dash(self):
	ref_line = u"""[20] G. Duckeck et al., “ATLAS computing: Technical design report,” CERN-LHCC2005-022."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">20</subfield>
	<subfield code="h">G. Duckeck et al.</subfield>
	- <subfield code="t">ATLAS computing Technical design report</subfield>
	+ <subfield code="t">ATLAS computing: Technical design report</subfield>
	<subfield code="r">CERN-LHCC-2005-022</subfield>
	</datafield>
	</record>""")

	def test_report_with_slashes(self):
	ref_line = u"""[20] G. Duckeck et al., “ATLAS computing: Technical design report,” CERN/LHCC/2005-022."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">20</subfield>
	<subfield code="h">G. Duckeck et al.</subfield>
	- <subfield code="t">ATLAS computing Technical design report</subfield>
	+ <subfield code="t">ATLAS computing: Technical design report</subfield>
	<subfield code="r">CERN-LHCC-2005-022</subfield>
	</datafield>
	</record>""")

	def test_ed_before_et_al(self):
	ref_line = u"""[20] G. Duckeck, (ed. ) et al., “ATLAS computing: Technical design report,” CERN-LHCC-2005-022."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">20</subfield>
	<subfield code="h">G. Duckeck, (ed.) et al.</subfield>
	- <subfield code="t">ATLAS computing Technical design report</subfield>
	+ <subfield code="t">ATLAS computing: Technical design report</subfield>
	<subfield code="r">CERN-LHCC-2005-022</subfield>
	</datafield>
	</record>""")

	def test_journal_but_no_page(self):
	ref_line = u"""[20] G. Duckeck, “ATLAS computing: Technical design report,” JHEP,03,1988"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">20</subfield>
	<subfield code="h">G. Duckeck</subfield>
	- <subfield code="t">ATLAS computing Technical design report</subfield>
	+ <subfield code="t">ATLAS computing: Technical design report</subfield>
	</datafield>
	</record>""")

	def test_isbn1(self):
	ref_line = u"""[22] B. Crowell, Vibrations and Waves. www.lightandmatter.com, 2009. ISBN 0-9704670-3-6."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">22</subfield>
	<subfield code="h">B. Crowell</subfield>
	<subfield code="i">0-9704670-3-6</subfield>
	</datafield>
	</record>""")

	def test_isbn2(self):
	ref_line = u"""[119] D. E. Gray, American Institute of Physics Handbook. Mcgraw-Hill, 3rd ed., 1972. ISBN 9780070014855."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">119</subfield>
	<subfield code="h">D. E. Gray</subfield>
	<subfield code="p">McGraw-Hill</subfield>
	<subfield code="i">9780070014855</subfield>
	</datafield>
	</record>""")

	def test_book(self):
	ref_line = u"""[1] D. Griffiths, “Introduction to elementary particles,” Weinheim, USA: Wiley-VCH (2008) 454 p."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">1</subfield>
	<subfield code="h">D. Griffiths</subfield>
	<subfield code="p">Wiley-VCH</subfield>
	<subfield code="t">Introduction to elementary particles</subfield>
	- <subfield code="xbook" />
	<subfield code="y">2008</subfield>
	</datafield>
	</record>""")

	def test_complex_arxiv(self):
	ref_line = u"""[4] J.Prat, arXiv:1012.3675v1 [physics.ins-det]"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">4</subfield>
	<subfield code="h">J.Prat</subfield>
	<subfield code="r">arXiv:1012.3675 [physics.ins-det]</subfield>
	</datafield>
	</record>""")

	def test_new_arxiv(self):
	ref_line = u"""[178] D. R. Tovey, On measuring the masses of pair-produced semi-invisibly decaying particles at hadron colliders, JHEP 04 (2008) 034, [0802.2879]."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">178</subfield>
	<subfield code="h">D. R. Tovey</subfield>
	<subfield code="s">JHEP,0804,034</subfield>
	<subfield code="r">arXiv:0802.2879</subfield>
	<subfield code="y">2008</subfield>
	</datafield>
	</record>""")

	def test_new_arxiv2(self):
	ref_line = u"""[178] D. R. Tovey, On measuring the masses of pair-produced semi-invisibly decaying particles at hadron colliders, JHEP 04 (2008) 034, [9112.2879]."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">178</subfield>
	<subfield code="h">D. R. Tovey</subfield>
	<subfield code="s">JHEP,0804,034</subfield>
	<subfield code="r">arXiv:9112.2879</subfield>
	<subfield code="y">2008</subfield>
	</datafield>
	</record>""")

	def test_new_arxiv3(self):
	ref_line = u"""[178] D. R. Tovey, On measuring the masses of pair-produced semi-invisibly decaying particles at hadron colliders, JHEP 04 (2008) 034, [1212.2879]."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">178</subfield>
	<subfield code="h">D. R. Tovey</subfield>
	<subfield code="s">JHEP,0804,034</subfield>
	<subfield code="r">arXiv:1212.2879</subfield>
	<subfield code="y">2008</subfield>
	</datafield>
	</record>""")

	def test_new_arxiv_invalid(self):
	ref_line = u"""[178] D. R. Tovey, On measuring the masses of pair-produced semi-invisibly decaying particles at hadron colliders, JHEP 04 (2008) 034, [9002.2879]."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">178</subfield>
	<subfield code="h">D. R. Tovey</subfield>
	<subfield code="s">JHEP,0804,034</subfield>
	<subfield code="y">2008</subfield>
	</datafield>
	</record>""")

	def test_new_arxiv_invalid2(self):
	ref_line = u"""[178] D. R. Tovey, On measuring the masses of pair-produced semi-invisibly decaying particles at hadron colliders, JHEP 04 (2008) 034, [9113.2879]."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">178</subfield>
	<subfield code="h">D. R. Tovey</subfield>
	<subfield code="s">JHEP,0804,034</subfield>
	<subfield code="y">2008</subfield>
	</datafield>
	</record>""")

	def test_special_journals(self):
	ref_line = u"""[178] D. R. Tovey, JHEP 04 (2008) 034"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">178</subfield>
	<subfield code="h">D. R. Tovey</subfield>
	<subfield code="s">JHEP,0804,034</subfield>
	<subfield code="y">2008</subfield>
	</datafield>
	</record>""")

	def test_unrecognized_author(self):
	ref_line = u"""[27] B. Feng, Y. -H. He, P. Fre', "On correspondences between toric singularities and (p,q) webs," Nucl. Phys. B701 (2004) 334-356. [hep-th/0403133]"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">27</subfield>
	<subfield code="h">B. Feng, Y. -H. He, P. Fre'</subfield>
	<subfield code="t">On correspondences between toric singularities and (p,q) webs</subfield>
	<subfield code="s">Nucl.Phys.,B701,334</subfield>
	<subfield code="r">hep-th/0403133</subfield>
	<subfield code="y">2004</subfield>
	</datafield>
	</record>""")

	def test_unrecognized_author2(self):
	ref_line = u"""[75] J. M. Figueroa-O’Farrill, J. M. Figueroa-O'Farrill, C. M. Hull and B. J. Spence, "Branes at conical singularities and holography," Adv. Theor. Math. Phys. 2, 1249 (1999) [arXiv:hep-th/9808014]"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">75</subfield>
	<subfield code="h">J. M. Figueroa-O’Farrill, J. M. Figueroa-O'Farrill, C. M. Hull and B. J. Spence</subfield>
	<subfield code="t">Branes at conical singularities and holography</subfield>
	<subfield code="s">Adv.Theor.Math.Phys.,2,1249</subfield>
	<subfield code="r">hep-th/9808014</subfield>
	<subfield code="y">1999</subfield>
	</datafield>
	</record>""")

	def test_pos(self):
	ref_line = u"""[23] M. A. Donnellan, et al., PoS LAT2007 (2007) 369."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">23</subfield>
	<subfield code="h">M. A. Donnellan, et al.</subfield>
	<subfield code="s">PoS,LAT2007,369</subfield>
	<subfield code="y">2007</subfield>
	</datafield>
	</record>""")

	def test_pos2(self):
	ref_line = u"""[23] M. A. Donnellan, et al., PoS LAT2007 2007 369."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">23</subfield>
	<subfield code="h">M. A. Donnellan, et al.</subfield>
	<subfield code="s">PoS,LAT2007,369</subfield>
	<subfield code="y">2007</subfield>
	</datafield>
	</record>""")

	def test_pos3(self):
	ref_line = u"""[23] M. A. Donnellan, et al., PoS(LAT2005)239."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">23</subfield>
	<subfield code="h">M. A. Donnellan, et al.</subfield>
	<subfield code="s">PoS,LAT2005,239</subfield>
	<subfield code="y">2005</subfield>
	</datafield>
	</record>""")

	+ def test_pos4(self):
	+ ref_line = u"""[23] PoS CHARGED 2010, 030 (2010)"""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">23</subfield>
	+ <subfield code="s">PoS,CHARGED2010,030</subfield>
	+ <subfield code="y">2010</subfield>
	+ </datafield>
	+</record>""")
	+
	+
	def test_complex_author(self):
	ref_line = u"""[39] Michael E. Peskin, Michael E. Peskin and Michael E. Peskin “An Introduction To Quantum Field Theory,” Westview Press, 1995."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">39</subfield>
	<subfield code="h">Michael E. Peskin, Michael E. Peskin and Michael E. Peskin</subfield>
	<subfield code="t">An Introduction To Quantum Field Theory</subfield>
	</datafield>
	</record>""")

	def test_complex_author2(self):
	ref_line = u"""[39] Dan V. Schroeder, Dan V. Schroeder and Dan V. Schroeder “An Introduction To Quantum Field Theory,” Westview Press, 1995."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">39</subfield>
	<subfield code="h">Dan V. Schroeder, Dan V. Schroeder and Dan V. Schroeder</subfield>
	<subfield code="t">An Introduction To Quantum Field Theory</subfield>
	</datafield>
	</record>""")

	def test_dan_journal(self):
	ref_line = u"""[39] Michael E. Peskin and Dan V. Schroeder “An Introduction To Quantum Field Theory,” Westview Press, 1995."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">39</subfield>
	<subfield code="h">Michael E. Peskin and Dan V. Schroeder</subfield>
	<subfield code="t">An Introduction To Quantum Field Theory</subfield>
	</datafield>
	</record>""")

	def test_dan_journal2(self):
	ref_line = u"""[39] Dan V. Schroeder DAN B701 (2004) 334-356"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">39</subfield>
	<subfield code="h">Dan V. Schroeder</subfield>
	<subfield code="s">Dokl.Akad.Nauk Ser.Fiz.,B701,334</subfield>
	<subfield code="y">2004</subfield>
	</datafield>
	</record>""")

	def test_query_in_url(self):
	ref_line = u"""[69] ATLAS Collaboration. Mutag. http://indico.cern.ch/getFile.py/access?contribId=9&resId=1&materialId=slides&confId=35502"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">69</subfield>
	- <subfield code="h">(ATLAS Collaboration)</subfield>
	+ <subfield code="c">ATLAS Collaboration</subfield>
	<subfield code="u">http://indico.cern.ch/getFile.py/access?contribId=9&resId=1&materialId=slides&confId=35502</subfield>
	</datafield>
	</record>""")

	def test_volume_colon_page(self):
	ref_line = u"""[77] J. M. Butterworth et al. Multiparton interactions in photoproduction at hera. Z.Phys.C72:637-646,1996."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">77</subfield>
	<subfield code="h">J. M. Butterworth et al.</subfield>
	<subfield code="s">Z.Phys.,C72,637</subfield>
	<subfield code="y">1996</subfield>
	</datafield>
	</record>""")

	def test_no_spaces_numeration(self):
	ref_line = u"""[1] I.M. Gregor et al, Optical links for the ATLAS SCT and Pixel detector, Z.Phys. 465(2001)131-134"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">1</subfield>
	<subfield code="h">I.M. Gregor et al.</subfield>
	<subfield code="s">Z.Phys.,465,131</subfield>
	<subfield code="y">2001</subfield>
	</datafield>
	</record>""")

	def test_dot_after_year(self):
	ref_line = u"""[1] Neutrino Mass and New Physics, Phys.Rev. 2006. 56:569-628"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">1</subfield>
	<subfield code="s">Phys.Rev.,56,569</subfield>
	<subfield code="y">2006</subfield>
	</datafield>
	</record>""")

	def test_journal_roman(self):
	ref_line = u"""[19] D. Page and C. Pope, Commun. Math. Phys. VI (1990) 529."""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">19</subfield>
	<subfield code="h">D. Page and C. Pope</subfield>
	<subfield code="s">Commun.Math.Phys.,6,529</subfield>
	<subfield code="y">1990</subfield>
	</datafield>
	</record>""")

	def test_journal_phys_rev_d(self):
	ref_line = u"""[6] Sivers D. W., Phys. Rev.D, 41 (1990) 83"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">6</subfield>
	<subfield code="h">Sivers D. W.</subfield>
	<subfield code="s">Phys.Rev.,D41,83</subfield>
	<subfield code="y">1990</subfield>
	</datafield>
	</record>""")

	def test_publisher(self):
	ref_line = u"""[6] Sivers D. W., BrAnS Hello"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">6</subfield>
	<subfield code="h">Sivers D. W.</subfield>
	<subfield code="p">Brans</subfield>
	</datafield>
	</record>""")

	def test_hep_formatting(self):
	ref_line = u"""[6] Sivers D. W., hep-ph-9711200"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">6</subfield>
	<subfield code="h">Sivers D. W.</subfield>
	<subfield code="r">hep-ph/9711200</subfield>
	</datafield>
	</record>""")

	def test_hep_formatting2(self):
	ref_line = u"""[6] Sivers D. W., astro-ph-9711200"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">6</subfield>
	<subfield code="h">Sivers D. W.</subfield>
	<subfield code="r">astro-ph/9711200</subfield>
	</datafield>
	</record>""")

	def test_nucl_phys_b_removal(self):
	ref_line = u"""[6] Sivers D. W., Nucl. Phys. (Proc.Suppl.) B21 (2004) 334-356"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">6</subfield>
	<subfield code="h">Sivers D. W.</subfield>
	<subfield code="s">Nucl.Phys.Proc.Suppl.,21,334</subfield>
	<subfield code="y">2004</subfield>
	</datafield>
	</record>""")

	def test_citations_splitting(self):
	ref_line = u"""[6] Sivers D. W., CERN-EX-0106015, D. Page, CERN-EX-0104007"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">6</subfield>
	<subfield code="h">Sivers D. W.</subfield>
	<subfield code="r">CERN-EX-0106015</subfield>
	+ <subfield code="0">1</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">6</subfield>
	<subfield code="h">D. Page</subfield>
	<subfield code="r">CERN-EX-0104007</subfield>
	+ <subfield code="0">2</subfield>
	</datafield>
	</record>""")

	def test_citations_splitting2(self):
	ref_line = u"""[6] Sivers D. W., hep-ex/0201013, D. Page, CERN-EP-2001-094"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">6</subfield>
	<subfield code="h">Sivers D. W.</subfield>
	<subfield code="r">hep-ex/0201013</subfield>
	<subfield code="r">CERN-EP-2001-094</subfield>
	+ <subfield code="0">10</subfield>
	</datafield>
	</record>""")

	def test_arxiv_report_number(self):
	"""Should be recognized by arxiv regexps list

	(not in report-numbers.kb)
	"""
	- ref_line = u"""[6] Sivers D. W., math.AA/8888888"""
	+ ref_line = u"""[6] Sivers D. W., math.AA/0101888"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">6</subfield>
	<subfield code="h">Sivers D. W.</subfield>
	- <subfield code="r">math.AA/8888888</subfield>
	+ <subfield code="r">math.AA/0101888</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_arxiv_report_number2(self):
	+ """: instead of / in arxiv report number"""
	+ ref_line = u"""[12] C. T. Hill and E. H. Simmons, Phys. Rept. 381: 235-402 (2003), Erratum-ibid. 390: 553-554 (2004) [arXiv: hep-ph:0203079]."""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">12</subfield>
	+ <subfield code="h">C. T. Hill and E. H. Simmons</subfield>
	+ <subfield code="r">hep-ph/0203079</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_arxiv_report_number3(self):
	+ """: instead of / in arxiv report number"""
	+ ref_line = u"""[12] hep-ph/0203079v1"""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">12</subfield>
	+ <subfield code="r">hep-ph/0203079</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_arxiv_report_number4(self):
	+ """: instead of / in arxiv report number"""
	+ ref_line = u"""[12] hep-ph/0203079invalid"""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">12</subfield>
	+ <subfield code="m">hep-ph/0203079invalid</subfield>
	+ </datafield>
	+</record>""", ignore_misc=False)
	+
	+ def test_arxiv_not_parsed(self):
	+ ref_line = u"""[12] arXiv: 0701034 [hep-ph]"""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">12</subfield>
	+ <subfield code="r">hep-ph/0701034</subfield>
	</datafield>
	</record>""")

	def test_arxiv_report_number_replacement(self):
	"""Should be replaced by a valid arxiv report number"""
	- ref_line = u"""[6] Sivers D. W., astro-phy/8888888"""
	+ ref_line = u"""[6] Sivers D. W., astro-phy/0101888"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">6</subfield>
	<subfield code="h">Sivers D. W.</subfield>
	- <subfield code="r">astro-ph/8888888</subfield>
	+ <subfield code="r">astro-ph/0101888</subfield>
	</datafield>
	</record>""")

	def test_only_report_number(self):
	ref_line = u"""[6] ATL-PHYS-INT-2009-110"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">6</subfield>
	<subfield code="r">ATL-PHYS-INT-2009-110</subfield>
	</datafield>
	</record>""")

	def test_only_journal(self):
	ref_line = u"""[6] Phys. Rev.D, 41 (1990) 83"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">6</subfield>
	<subfield code="s">Phys.Rev.,D41,83</subfield>
	<subfield code="y">1990</subfield>
	</datafield>
	</record>""")

	def test_only_doi(self):
	ref_line = u"""[6] doi:10.1007/s10440-008-9280-9"""
	_reference_test(self, ref_line, u"""<record>
	- <controlfield tag="001">1</controlfield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">6</subfield>
	<subfield code="a">10.1007/s10440-008-9280-9</subfield>
	</datafield>
	</record>""")

	def test_reference_size_limit_check_valid_in_one_line(self):
	- from invenio.refextract_api import extract_references_from_string_xml
	+ from invenio.refextract_api import extract_references_from_string
	ref_line = u"""[1] D. Adams, S. Asai, D. Cavalli, K. Edmonds,
	The ATLFAST-II performance in release 14,
	Tech. Rep. ATL-PHYS-INT-2009-110, CERN, Geneva, Dec, 2009.
	[2] D. Adams, ATL-PHYS-INT-2009-111"""
	- refs = extract_references_from_string_xml(ref_line)
	- compare_references(self, refs, u"""<record>
	- <controlfield tag="001">1</controlfield>
	+ record = extract_references_from_string(ref_line)
	+ compare_references(self, record, u"""<record>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">1</subfield>
	<subfield code="h">D. Adams, S. Asai, D. Cavalli, K. Edmonds</subfield>
	<subfield code="r">ATL-PHYS-INT-2009-110</subfield>
	</datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">2</subfield>
	<subfield code="h">D. Adams</subfield>
	<subfield code="r">ATL-PHYS-INT-2009-111</subfield>
	</datafield>
	</record>""")

	def test_reference_size_limit_but_removed_as_invalid(self):
	"""Test the removal of references that are more than n lines long

	Needs to match test_reference_size_limit_check_valid_in_one_line
	above but be on multiple lines
	"""
	- from invenio.refextract_api import extract_references_from_string_xml
	+ from invenio.refextract_api import extract_references_from_string
	ref_line = u"""[1] D. Adams, S. Asai, D. Cavalli, K. Edmonds,
	a\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\n
	a\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\n
	The ATLFAST-II performance in release 14,
	Tech. Rep. ATL-PHYS-INT-2009-110, CERN, Geneva, Dec, 2009.
	[2] D. Adams, ATL-PHYS-INT-2009-111"""
	- refs = extract_references_from_string_xml(ref_line)
	- compare_references(self, refs, u"""<record>
	- <controlfield tag="001">1</controlfield>
	+ record = extract_references_from_string(ref_line)
	+ compare_references(self, record, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">1</subfield>
	+ <subfield code="h">D. Adams, S. Asai, D. Cavalli, K. Edmonds</subfield>
	+ </datafield>
	<datafield tag="999" ind1="C" ind2="5">
	<subfield code="o">2</subfield>
	<subfield code="h">D. Adams</subfield>
	<subfield code="r">ATL-PHYS-INT-2009-111</subfield>
	</datafield>
	</record>""")

	+ def test_author_tag_inside_quoted(self):
	+ """Tests embeded tags in quoted text
	+
	+ We want to avoid this
	+ <cds.QUOTED>Electroweak parameters of the Z0 resonance and the Standard
	+ Model <cds.AUTHincl>the LEP Collaborations</cds.AUTHincl></cds.QUOTED>
	+ """
	+ ref_line = u"""[10] LEP Collaboration, G. Alexander et al., “Electroweak parameters of the Z0 resonance and the Standard Model: the LEP Collaborations,” Phys. Lett. B276 (1992) 247–253."""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">10</subfield>
	+ <subfield code="c">LEP Collaboration</subfield>
	+ <subfield code="h">G. Alexander et al.</subfield>
	+ <subfield code="t">Electroweak parameters of the Z0 resonance and the Standard Model: the LEP Collaborations</subfield>
	+ <subfield code="s">Phys.Lett.,B276,247</subfield>
	+ <subfield code="y">1992</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_misparsing_arxiv(self):
	+ ref_line = u"""[21] R. Barlow, Asymmetric errors, eConf C030908 (2003), arXiv:physics/0401042."""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">21</subfield>
	+ <subfield code="h">R. Barlow</subfield>
	+ <subfield code="r">physics/0401042</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_no_volume(self):
	+ ref_line = u"""[6] Owen F.N., Rudnick L., 1976, Phys. Rev., 205L, 1"""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">6</subfield>
	+ <subfield code="h">Owen F.N., Rudnick L.</subfield>
	+ <subfield code="s">Phys.Rev.,L205,1</subfield>
	+ <subfield code="y">1976</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_numeration_detached(self):
	+ """Numeration detection check
	+
	+ At some point was reporting two journals, detecting twice the same
	+ numeration
	+ """
	+ ref_line = u"""[6] B. Friman, in The CBM Phys. Rev. book: Compressed baryonic matter in laboratory, Phys. Rev. 814, 1 (2011)."""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">6</subfield>
	+ <subfield code="h">B. Friman</subfield>
	+ <subfield code="s">Phys.Rev.,814,1</subfield>
	+ <subfield code="y">2011</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_no_volume2(self):
	+ """At some point failed to report volume correctly"""
	+ ref_line = u"""[3] S. Sarkar, Nucl. Phys. A 862-863, 13 (2011)."""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">3</subfield>
	+ <subfield code="h">S. Sarkar</subfield>
	+ <subfield code="s">Nucl.Phys.,A862,13</subfield>
	+ <subfield code="y">2011</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_journal_title_mangled(self):
	+ """Makes sure this journal gets confused with an author"""
	+ ref_line = u"""[12] K. G. Chetyrkin and A. Khodjamirian, Eur. Phys. J. C46 (2006)
	+721"""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">12</subfield>
	+ <subfield code="h">K. G. Chetyrkin and A. Khodjamirian</subfield>
	+ <subfield code="s">Eur.Phys.J.,C46,721</subfield>
	+ <subfield code="y">2006</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_volume_letter_goes_missing(self):
	+ ref_line = u"""[6] N. Cabibbo and G. Parisi, Phys. Lett. 59 B (1975) 67."""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">6</subfield>
	+ <subfield code="h">N. Cabibbo and G. Parisi</subfield>
	+ <subfield code="s">Phys.Lett.,B59,67</subfield>
	+ <subfield code="y">1975</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_removed_dot_in_authors(self):
	+ ref_line = u"""[6] Cabibbo N. and Parisi G.: Phys. Lett. 59 B (1975) 67."""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">6</subfield>
	+ <subfield code="h">Cabibbo N. and Parisi G.</subfield>
	+ <subfield code="s">Phys.Lett.,B59,67</subfield>
	+ <subfield code="y">1975</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_author_with_accents(self):
	+ ref_line = u"""[1] Ôrlo A., Eur. Phys. J. C46 (2006) 721"""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">1</subfield>
	+ <subfield code="h">Ôrlo A.</subfield>
	+ <subfield code="s">Eur.Phys.J.,C46,721</subfield>
	+ <subfield code="y">2006</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_implied_ibid(self):
	+ ref_line = u"""[4] S. F. King and G. G. Ross, Phys. Lett. B 520, 243 (2001); 574, 239 (2003)"""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">4</subfield>
	+ <subfield code="h">S. F. King and G. G. Ross</subfield>
	+ <subfield code="s">Phys.Lett.,B520,243</subfield>
	+ <subfield code="y">2001</subfield>
	+ </datafield>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">4</subfield>
	+ <subfield code="s">Phys.Lett.,B574,239</subfield>
	+ <subfield code="y">2003</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_implied_ibid2(self):
	+ ref_line = u"""[4] S. F. King and G. G. Ross, Phys. Lett. B 520, 243 (2001); C574, 239 (2003)"""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">4</subfield>
	+ <subfield code="h">S. F. King and G. G. Ross</subfield>
	+ <subfield code="s">Phys.Lett.,B520,243</subfield>
	+ <subfield code="y">2001</subfield>
	+ </datafield>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">4</subfield>
	+ <subfield code="s">Phys.Lett.,C574,239</subfield>
	+ <subfield code="y">2003</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_implied_ibid3(self):
	+ ref_line = u"""[4] S. F. King and G. G. Ross, Phys. Lett. B 520, 243 (2001); 574, 239 (2003); 575, 240 (2004); 576, 241 (2005)"""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">4</subfield>
	+ <subfield code="h">S. F. King and G. G. Ross</subfield>
	+ <subfield code="s">Phys.Lett.,B520,243</subfield>
	+ <subfield code="y">2001</subfield>
	+ </datafield>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">4</subfield>
	+ <subfield code="s">Phys.Lett.,B574,239</subfield>
	+ <subfield code="y">2003</subfield>
	+ </datafield>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">4</subfield>
	+ <subfield code="s">Phys.Lett.,B575,240</subfield>
	+ <subfield code="y">2004</subfield>
	+ </datafield>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">4</subfield>
	+ <subfield code="s">Phys.Lett.,B576,241</subfield>
	+ <subfield code="y">2005</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_implied_ibid4(self):
	+ ref_line = u"""[10] R. Foot, H.N. Long and T.A. Tran, Phys. Rev. D50, R34 (1994); H.N. Long, ibid. 53, 437 (1996); 54, 4691 (1996)."""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">10</subfield>
	+ <subfield code="h">R. Foot, H.N. Long and T.A. Tran</subfield>
	+ <subfield code="s">Phys.Rev.,D50,R34</subfield>
	+ <subfield code="y">1994</subfield>
	+ </datafield>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">10</subfield>
	+ <subfield code="h">H.N. Long</subfield>
	+ <subfield code="s">Phys.Rev.,D53,437</subfield>
	+ <subfield code="y">1996</subfield>
	+ </datafield>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">10</subfield>
	+ <subfield code="s">Phys.Rev.,D54,4691</subfield>
	+ <subfield code="y">1996</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_report_number(self):
	+ ref_line = u"""[10] [physics.plasm-ph/0409093]."""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">10</subfield>
	+ <subfield code="r">physics.plasm-ph/0409093</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_journal2(self):
	+ ref_line = u"""[1] Phys.Rev. A, : 78 (2008) 012115"""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">1</subfield>
	+ <subfield code="s">Phys.Rev.,A78,012115</subfield>
	+ <subfield code="y">2008</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_authors_merge(self):
	+ ref_line = u"""[44] R. Baier et al., Invalid. Hello. Lett. B 345 (1995)."""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">44</subfield>
	+ <subfield code="h">R. Baier et al.</subfield>
	+ <subfield code="m">Invalid. Hello. Lett. B 345 (1995)</subfield>
	+ </datafield>
	+</record>""", ignore_misc=False)
	+
	+ def test_atlas_conf_99(self):
	+ ref_line = u'[14] ATLAS-CONF-99-078'
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">14</subfield>
	+ <subfield code="r">ATL-CONF-99-078</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_atlas_conf_pre_2010(self):
	+ ref_line = u'[14] ATL-CONF-2003-078'
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">14</subfield>
	+ <subfield code="r">ATL-CONF-2003-078</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_atlas_conf_pre_2010_2(self):
	+ ref_line = u'[14] ATLAS-CONF-2003-078'
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">14</subfield>
	+ <subfield code="r">ATL-CONF-2003-078</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_atlas_conf_post_2010(self):
	+ ref_line = u'[14] ATLAS-CONF-2012-078'
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">14</subfield>
	+ <subfield code="r">ATLAS-CONF-2012-078</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_atlas_conf_post_2010_2(self):
	+ ref_line = u'[14] ATL-CONF-2012-078'
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">14</subfield>
	+ <subfield code="r">ATLAS-CONF-2012-078</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_atlas_conf_post_2010_invalid(self):
	+ ref_line = u'[14] ATL-CONF-2012-0784'
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">14</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_journal_missed(self):
	+ ref_line = u"[1] M. G. Mayer, Phys. Rev. 75 (1949), 1969; O. Hazel, J. H. D. Jensen, and H. E. Suess, Phys. Rev. 75 (1949), 1766."
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">1</subfield>
	+ <subfield code="h">M. G. Mayer</subfield>
	+ <subfield code="s">Phys.Rev.,75,1969</subfield>
	+ <subfield code="y">1949</subfield>
	+ </datafield>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">1</subfield>
	+ <subfield code="h">O. Hazel, J. H. D. Jensen, and H. E. Suess</subfield>
	+ <subfield code="s">Phys.Rev.,75,1766</subfield>
	+ <subfield code="y">1949</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_invalid_publisher(self):
	+ """test_invalid_publisher
	+
	+ This needs to not consider the lbl in Hoelbling as a publisher"""
	+ ref_line = u"[35] G. I. Egri, Z. Fodor, C. Hoelbling, S. D. Katz, D. Nógrádi, et. al., Lattice QCD as a video game, Comput.Phys.Commun. 177 (2007) 631–639, [hep-lat/0611022]."
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">35</subfield>
	+ <subfield code="h">G. I. Egri, Z. Fodor, C. Hoelbling, S. D. Katz, D. N\xf3gr\xe1di, et al.</subfield>
	+ <subfield code="r">hep-lat/0611022</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_valid_publisher(self):
	+ """test_invalid_publisher
	+
	+ This needs to not consider the lbl in Hoelbling as a publisher"""
	+ ref_line = u"[35] [LBL]"
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield tag="999" ind1="C" ind2="5">
	+ <subfield code="o">35</subfield>
	+ <subfield code="p">LBL</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_missed_collaboration(self):
	+ ref_line = u"""[76] these results replace the Λb → J/ψΛ and B0 → J/ψKS lifetime measurements of A. Abulencia et al. (CDF collaboration), Phys. Rev. Lett. 98, 122001 (2007), arXiv:hep-ex/0609021, as well as the B0 → J/ψK∗0"""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield ind1="C" ind2="5" tag="999">
	+ <subfield code="o">76</subfield>
	+ <subfield code="h">Abulencia et al.</subfield>
	+ <subfield code="c">CDF collaboration</subfield>
	+ <subfield code="s">Phys.Rev.Lett.,98,122001</subfield>
	+ <subfield code="r">hep-ex/0609021</subfield>
	+ <subfield code="y">2007</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_remove_duplicate_doi(self):
	+ ref_line = u"""[1] doi:10.1007/s10440-008-9280-9 doi:10.1007/s10440-008-9280-9"""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield ind1="C" ind2="5" tag="999">
	+ <subfield code="o">1</subfield>
	+ <subfield code="a">10.1007/s10440-008-9280-9</subfield>
	+ </datafield>
	+</record>""")
	+
	+ def test_leftover_tag(self):
	+ ref_line = u"""[2] ΦΦΦΦΦΦΦΦΦΦΦΦΦΦΦΦΦΦ E. Dudas, G. von Gersdorff, J. Parmentier and S. Pokorski, arXiv:1007.5208."""
	+ _reference_test(self, ref_line, u"""<record>
	+ <datafield ind1="C" ind2="5" tag="999">
	+ <subfield code="o">2</subfield>
	+ <subfield code="h">E. Dudas, G. von Gersdorff, J. Parmentier and S. Pokorski</subfield>
	+ <subfield code="r">arXiv:1007.5208</subfield>
	+ </datafield>
	+</record>""")
	+

	class TaskTest(InvenioTestCase):
	def setUp(self):
	setup_loggers(verbosity=0)

	def test_task_run_core(self):
	from invenio.refextract_task import task_run_core
	task_run_core(1)

	TEST_SUITE = make_test_suite(RefextractTest)
	if __name__ == '__main__':
	run_test_suite(TEST_SUITE, warn_user=True)
	diff --git a/modules/docextract/lib/refextract_tag.py b/modules/docextract/lib/refextract_tag.py
	index 84647b970..204303672 100644
	--- a/modules/docextract/lib/refextract_tag.py
	+++ b/modules/docextract/lib/refextract_tag.py
	@@ -1,1453 +1,1405 @@
	# -- coding: utf-8 --
	##
	## This file is part of Invenio.
	## Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010, 2011 CERN.
	##
	## Invenio is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## Invenio is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	import re

	+from unidecode import unidecode
	+
	from invenio.refextract_config import \
	CFG_REFEXTRACT_MARKER_CLOSING_AUTHOR_ETAL, \
	CFG_REFEXTRACT_MARKER_CLOSING_AUTHOR_INCL, \
	CFG_REFEXTRACT_MARKER_CLOSING_AUTHOR_STND, \
	- CFG_REFEXTRACT_MARKER_CLOSING_TITLE_IBID
	+ CFG_REFEXTRACT_MARKER_CLOSING_TITLE_IBID, \
	+ CFG_REFEXTRACT_MARKER_OPENING_TITLE_IBID, \
	+ CFG_REFEXTRACT_MARKER_OPENING_COLLABORATION, \
	+ CFG_REFEXTRACT_MARKER_CLOSING_COLLABORATION

	from invenio.docextract_text import remove_and_record_multiple_spaces_in_line

	from invenio.refextract_re import \
	re_ibid, \
	re_doi, \
	re_raw_url, \
	- re_matched_ibid, \
	re_series_from_numeration, \
	- re_series_from_title, \
	re_punctuation, \
	- re_identify_bf_before_vol, \
	re_correct_numeration_2nd_try_ptn1, \
	re_correct_numeration_2nd_try_ptn2, \
	re_correct_numeration_2nd_try_ptn3, \
	re_correct_numeration_2nd_try_ptn4, \
	- re_correct_numeration_2nd_try_ptn5, \
	re_numeration_nucphys_vol_page_yr, \
	re_numeration_vol_subvol_nucphys_yr_page, \
	re_numeration_nucphys_vol_yr_page, \
	re_multiple_hyphens, \
	re_numeration_vol_page_yr, \
	re_numeration_vol_yr_page, \
	re_numeration_vol_nucphys_series_yr_page, \
	re_numeration_vol_series_nucphys_page_yr, \
	re_numeration_vol_nucphys_series_page_yr, \
	re_html_tagged_url, \
	re_numeration_yr_vol_page, \
	re_numeration_vol_nucphys_page_yr, \
	re_wash_volume_tag, \
	re_numeration_vol_nucphys_yr_subvol_page, \
	re_quoted, \
	re_isbn, \
	re_arxiv, \
	re_new_arxiv, \
	re_pos, \
	re_pos_year_num, \
	- RE_OLD_ARXIV
	+ re_series_from_numeration_after_volume, \
	+ RE_OLD_ARXIV, \
	+ RE_ARXIV_CATCHUP, \
	+ RE_ATLAS_CONF_PRE_2010, \
	+ RE_ATLAS_CONF_POST_2010

	from invenio.authorextract_re import re_auth, \
	- re_extra_auth, \
	re_auth_near_miss, \
	re_etal, \
	etal_matches, \
	re_ed_notation


	from invenio.docextract_text import wash_line


	def tag_reference_line(line, kbs, record_titles_count):
	-
	- # initialise some variables:
	- # dictionaries to record information about, and coordinates of,
	- # matched IBID items:
	- found_ibids_len = {}
	- found_ibids_matchtext = {}
	- # dictionaries to record information about, and coordinates of,
	- # matched journal title items:
	- found_title_len = {}
	- found_title_matchtext = {}
	- # dictionaries to record information about, and the coordinates of,
	- # matched preprint report number items
	- found_pprint_repnum_matchlens = {} # lengths of given matches of
	- # preprint report numbers
	- found_pprint_repnum_replstr = {} # standardised replacement
	- # strings for preprint report
	- # numbers to be substituted into
	- # a line
	-
	# take a copy of the line as a first working line, clean it of bad
	# accents, and correct puncutation, etc:
	working_line1 = wash_line(line)

	# Identify volume for POS journal
	working_line1 = tag_pos_volume(working_line1)

	- # Identify and standardise numeration in the line:
	- working_line1 = tag_numeration(working_line1)
	-
	- # Now that numeration has been marked-up, check for and remove any
	- # ocurrences of " bf ":
	- working_line1 = re_identify_bf_before_vol.sub(ur" \1", working_line1)
	-
	# Clean the line once more:
	working_line1 = wash_line(working_line1)

	# We identify quoted text
	# This is useful for books matching
	# This is also used by the author tagger to remove quoted
	# text which is a sign of a title and not an author
	- working_line1 = tag_titles(working_line1)
	+ working_line1 = tag_quoted_text(working_line1)

	# Identify ISBN (for books)
	working_line1 = tag_isbn(working_line1)

	# Identify arxiv reports
	working_line1 = tag_arxiv(working_line1)
	working_line1 = tag_arxiv_more(working_line1)
	# Identify volume for POS journal
	+ # needs special handling because the volume contains the year
	working_line1 = tag_pos_volume(working_line1)
	+ # Identify ATL-CONF and ATLAS-CONF report numbers
	+ # needs special handling because it has 2 formats depending on the year
	+ # and a 2 years digit format to convert
	+ working_line1 = tag_atlas_conf(working_line1)

	# Identify journals with regular expression
	# Some journals need to match exact regexps because they can
	# conflict with other elements
	# e.g. DAN is also a common first name
	- working_line1 = tag_journals_re(working_line1, kbs['journals_re'])
	+ standardised_titles = kbs['journals'][1]
	+ standardised_titles.update(kbs['journals_re'])
	+ journals_matches = identifiy_journals_re(working_line1, kbs['journals_re'])

	# Remove identified tags
	working_line2 = strip_tags(working_line1)

	# Transform the line to upper-case, now making a new working line:
	working_line2 = working_line2.upper()

	# Strip punctuation from the line:
	working_line2 = re_punctuation.sub(u' ', working_line2)

	# Remove multiple spaces from the line, recording
	# information about their coordinates:
	removed_spaces, working_line2 = \
	remove_and_record_multiple_spaces_in_line(working_line2)

	# Identify and record coordinates of institute preprint report numbers:
	found_pprint_repnum_matchlens, found_pprint_repnum_replstr, working_line2 =\
	identify_report_numbers(working_line2, kbs['report-numbers'])

	# Identify and record coordinates of non-standard journal titles:
	- found_title_len, found_title_matchtext, working_line2, line_titles_count = \
	+ journals_matches_more, working_line2, line_titles_count = \
	identify_journals(working_line2, kbs['journals'])
	+ journals_matches.update(journals_matches_more)

	# Add the count of 'bad titles' found in this line to the total
	# for the reference section:
	record_titles_count = sum_2_dictionaries(record_titles_count,
	line_titles_count)

	# Attempt to identify, record and replace any IBIDs in the line:
	if (working_line2.upper().find(u"IBID") != -1):
	# there is at least one IBID in the line - try to
	# identify its meaning:
	- found_ibids_len, found_ibids_matchtext, working_line2 = \
	+ found_ibids_matchtext, working_line2 = \
	identify_ibids(working_line2)
	# now update the dictionary of matched title lengths with the
	# matched IBID(s) lengths information:
	- found_title_len.update(found_ibids_len)
	- found_title_matchtext.update(found_ibids_matchtext)
	+ journals_matches.update(found_ibids_matchtext)

	publishers_matches = identify_publishers(working_line2, kbs['publishers'])

	tagged_line = process_reference_line(
	working_line=working_line1,
	- found_title_len=found_title_len,
	- found_title_matchtext=found_title_matchtext,
	+ journals_matches=journals_matches,
	pprint_repnum_len=found_pprint_repnum_matchlens,
	pprint_repnum_matchtext=found_pprint_repnum_replstr,
	publishers_matches=publishers_matches,
	removed_spaces=removed_spaces,
	- standardised_titles=kbs['journals'][1],
	- authors_kb=kbs['authors'],
	- publishers_kb=kbs['publishers'],
	+ standardised_titles=standardised_titles,
	+ kbs=kbs,
	)

	return tagged_line, record_titles_count


	def process_reference_line(working_line,
	- found_title_len,
	- found_title_matchtext,
	+ journals_matches,
	pprint_repnum_len,
	pprint_repnum_matchtext,
	publishers_matches,
	removed_spaces,
	standardised_titles,
	- authors_kb,
	- publishers_kb):
	+ kbs):
	"""After the phase of identifying and tagging citation instances
	in a reference line, this function is called to go through the
	line and the collected information about the recognised citations,
	and to transform the line into a string of MARC XML in which the
	recognised citations are grouped under various datafields and
	subfields, depending upon their type.
	@param line_marker: (string) - this is the marker for this
	reference line (e.g. [1]).
	@param working_line: (string) - this is the line before the
	punctuation was stripped. At this stage, it has not been
	capitalised, and neither TITLES nor REPORT NUMBERS have been
	stripped from it. However, any recognised numeration and/or URLs
	have been tagged with <cds.YYYY> tags.
	The working_line could, for example, look something like this:
	[1] CDS <cds.URL description="http //invenio-software.org/">
	http //invenio-software.org/</cds.URL>.
	@param found_title_len: (dictionary) - the lengths of the title
	citations that have been recognised in the line. Keyed by the index
	within the line of each match.
	@param found_title_matchtext: (dictionary) - The text that was found
	for each matched title citation in the line. Keyed by the index within
	the line of each match.
	@param pprint_repnum_len: (dictionary) - the lengths of the matched
	institutional preprint report number citations found within the line.
	Keyed by the index within the line of each match.
	@param pprint_repnum_matchtext: (dictionary) - The matched text for each
	matched institutional report number. Keyed by the index within the line
	of each match.
	@param identified_dois (list) - The list of dois inside the citation
	@identified_urls: (list) - contains 2-cell tuples, each of which
	represents an idenitfied URL and its description string.
	The list takes the order in which the URLs were identified in the line
	(i.e. first-found, second-found, etc).
	@param removed_spaces: (dictionary) - The number of spaces removed from
	the various positions in the line. Keyed by the index of the position
	within the line at which the spaces were removed.
	@param standardised_titles: (dictionary) - The standardised journal
	titles, keyed by the non-standard version of those titles.
	@return: (tuple) of 5 components:
	( string -> a MARC XML-ized reference line.
	integer -> number of fields of miscellaneous text marked-up
	for the line.
	integer -> number of title citations marked-up for the line.
	integer -> number of institutional report-number citations
	marked-up for the line.
	integer -> number of URL citations marked-up for the record.
	integer -> number of DOI's found for the record
	integer -> number of author groups found
	)

	"""
	- if len(found_title_len) + len(pprint_repnum_len) + len(publishers_matches) == 0:
	+ if len(journals_matches) + len(pprint_repnum_len) + len(publishers_matches) == 0:
	# no TITLE or REPORT-NUMBER citations were found within this line,
	# use the raw line: (This 'raw' line could still be tagged with
	# recognised URLs or numeration.)
	tagged_line = working_line
	else:
	# TITLE and/or REPORT-NUMBER citations were found in this line,
	# build a new version of the working-line in which the standard
	# versions of the REPORT-NUMBERs and TITLEs are tagged:
	startpos = 0 # First cell of the reference line...
	previous_match = {} # previously matched TITLE within line (used
	# for replacement of IBIDs.
	replacement_types = {}
	- journals_keys = found_title_matchtext.keys()
	+ journals_keys = journals_matches.keys()
	journals_keys.sort()
	reports_keys = pprint_repnum_matchtext.keys()
	reports_keys.sort()
	publishers_keys = publishers_matches.keys()
	publishers_keys.sort()
	spaces_keys = removed_spaces.keys()
	spaces_keys.sort()
	replacement_types = get_replacement_types(journals_keys,
	reports_keys,
	publishers_keys)
	replacement_locations = replacement_types.keys()
	replacement_locations.sort()

	tagged_line = u"" # This is to be the new 'working-line'. It will
	# contain the tagged TITLEs and REPORT-NUMBERs,
	# as well as any previously tagged URLs and
	# numeration components.
	# begin:
	for replacement_index in replacement_locations:
	# first, factor in any stripped spaces before this 'replacement'
	- (true_replacement_index, extras) = \
	+ true_replacement_index, extras = \
	account_for_stripped_whitespace(spaces_keys,
	removed_spaces,
	replacement_types,
	pprint_repnum_len,
	- found_title_len,
	+ journals_matches,
	replacement_index)

	if replacement_types[replacement_index] == u"journal":
	# Add a tagged periodical TITLE into the line:
	rebuilt_chunk, startpos, previous_match = \
	add_tagged_journal(
	reading_line=working_line,
	- len_title=found_title_len[replacement_index],
	- matched_title=found_title_matchtext[replacement_index],
	+ journal_info=journals_matches[replacement_index],
	previous_match=previous_match,
	startpos=startpos,
	true_replacement_index=true_replacement_index,
	extras=extras,
	standardised_titles=standardised_titles)
	tagged_line += rebuilt_chunk

	elif replacement_types[replacement_index] == u"reportnumber":
	# Add a tagged institutional preprint REPORT-NUMBER
	# into the line:
	rebuilt_chunk, startpos = \
	add_tagged_report_number(
	reading_line=working_line,
	len_reportnum=pprint_repnum_len[replacement_index],
	reportnum=pprint_repnum_matchtext[replacement_index],
	startpos=startpos,
	true_replacement_index=true_replacement_index,
	- extras=extras
	- )
	+ extras=extras)
	tagged_line += rebuilt_chunk

	elif replacement_types[replacement_index] == u"publisher":
	rebuilt_chunk, startpos = \
	add_tagged_publisher(
	reading_line=working_line,
	matched_publisher=publishers_matches[replacement_index],
	startpos=startpos,
	true_replacement_index=true_replacement_index,
	extras=extras,
	- kb_publishers=publishers_kb
	- )
	+ kb_publishers=kbs['publishers'])
	tagged_line += rebuilt_chunk

	# add the remainder of the original working-line into the rebuilt line:
	tagged_line += working_line[startpos:]

	- # use the recently marked-up title information to identify any
	- # numeration that escaped the last pass:
	- tagged_line = re_identify_numeration(tagged_line)
	# we have all the numeration
	# we can make sure there's no space between the volume
	# letter and the volume number
	# e.g. B 20 -> B20
	tagged_line = wash_volume_tag(tagged_line)

	- # Before moving onto creating the XML string...
	- # try to find any authors in the line
	- # Found authors are immediately placed into tags
	- # (after Titles and Repnum's have been found)
	- tagged_line = identify_and_tag_authors(tagged_line, authors_kb)
	+ # Try to find any authors in the line
	+ tagged_line = identify_and_tag_authors(tagged_line, kbs['authors'])
	+ # Try to find any collaboration in the line
	+ tagged_line = identify_and_tag_collaborations(tagged_line,
	+ kbs['collaborations'])

	return tagged_line.replace('\n', '')


	def wash_volume_tag(line):
	return re_wash_volume_tag[0].sub(re_wash_volume_tag[1], line)


	def tag_isbn(line):
	"""Tag books ISBN"""
	return re_isbn.sub(ur'<cds.ISBN>\g<code></cds.ISBN>', line)


	-def tag_titles(line):
	+def tag_quoted_text(line):
	"""Tag quoted titles

	We use titles for pretty display of references that we could not
	associate we record.
	We also use titles for recognising books.
	"""
	return re_quoted.sub(ur'<cds.QUOTED>\g<title></cds.QUOTED>', line)


	def tag_arxiv(line):
	"""Tag arxiv report numbers

	We handle arXiv in 2 ways:
	* starting with arXiv:1022.1111
	* this format exactly 9999.9999
	We also format the output to the standard arxiv notation:
	* arXiv:2007.12.1111
	* arXiv:2007.12.1111v2
	"""
	def tagger(match):
	groups = match.groupdict()
	if match.group('suffix'):
	groups['suffix'] = ' ' + groups['suffix']
	else:
	groups['suffix'] = ''
	return u'<cds.REPORTNUMBER>arXiv:%(year)s'\
	u'%(month)s.%(num)s%(suffix)s' \
	u'</cds.REPORTNUMBER>' % groups

	line = re_arxiv.sub(tagger, line)
	line = re_new_arxiv.sub(tagger, line)
	return line


	def tag_arxiv_more(line):
	- """Tag old arxiv report numbers"""
	+ """Tag old arxiv report numbers
	+
	+ Either formats:
	+ * hep-th/1234567
	+ * arXiv:1022111 [hep-ph] which transforms to hep-ph/1022111
	+ """
	+ line = RE_ARXIV_CATCHUP.sub(ur"\g<suffix>/\g<year>\g<month>\g<num>", line)
	+
	for report_re, report_repl in RE_OLD_ARXIV:
	- report_number = report_repl + ur"\g<num>"
	+ report_number = report_repl + ur"/\g<num>"
	line = report_re.sub(u'<cds.REPORTNUMBER>' + report_number \
	+ u'</cds.REPORTNUMBER>',
	line)
	return line


	def tag_pos_volume(line):
	"""Tag POS volume number

	POS is journal that has special volume numbers
	e.g. PoS LAT2007 (2007) 369
	"""
	def tagger(match):
	groups = match.groupdict()
	try:
	year = match.group('year')
	except IndexError:
	# Extract year from volume name
	# which should always include the year
	- g = re.search(re_pos_year_num, match.group('volume'), re.UNICODE)
	+ g = re.search(re_pos_year_num, match.group('volume_num'), re.UNICODE)
	year = g.group(0)

	if year:
	groups['year'] = ' <cds.YR>(%s)</cds.YR>' % year.strip().strip('()')
	else:
	groups['year'] = ''

	return '<cds.JOURNAL>PoS</cds.JOURNAL>' \
	- ' <cds.VOL>%(volume)s</cds.VOL>' \
	+ ' <cds.VOL>%(volume_name)s%(volume_num)s</cds.VOL>' \
	'%(year)s' \
	' <cds.PG>%(page)s</cds.PG>' % groups

	for p in re_pos:
	line = p.sub(tagger, line)

	return line


	-def tag_journals_re(line, kb_journals):
	- for pattern, journal in kb_journals:
	- line = pattern.sub(journal, line)
	+def tag_atlas_conf(line):
	+ line = RE_ATLAS_CONF_PRE_2010.sub(
	+ ur'<cds.REPORTNUMBER>ATL-CONF-\g<code></cds.REPORTNUMBER>', line)
	+ line = RE_ATLAS_CONF_POST_2010.sub(
	+ ur'<cds.REPORTNUMBER>ATLAS-CONF-\g<code></cds.REPORTNUMBER>', line)
	return line


	-def re_identify_numeration(line):
	+def identifiy_journals_re(line, kb_journals):
	+ matches = {}
	+ for pattern, dummy in kb_journals:
	+ match = re.search(pattern, line)
	+ if match:
	+ matches[match.start()] = match.group(0)
	+ return matches
	+
	+
	+def find_numeration_more(line):
	"""Look for other numeration in line."""
	# First, attempt to use marked-up titles
	patterns = (
	re_correct_numeration_2nd_try_ptn1,
	re_correct_numeration_2nd_try_ptn2,
	re_correct_numeration_2nd_try_ptn3,
	re_correct_numeration_2nd_try_ptn4,
	- re_correct_numeration_2nd_try_ptn5,
	)
	- for pattern, replacement in patterns:
	- line = pattern.sub(replacement, line)
	- return line
	+ for pattern in patterns:
	+ match = pattern.search(line)
	+ if match:
	+ info = match.groupdict()
	+ series = extract_series_from_volume(info['vol'])
	+ if not info['vol_num']:
	+ info['vol_num'] = info['vol_num_alt']
	+ if not info['vol_num']:
	+ info['vol_num'] = info['vol_num_alt2']
	+ return {'year': info.get('year', None),
	+ 'series': series,
	+ 'volume': info['vol_num'],
	+ 'page': info['page'],
	+ 'len': len(info['aftertitle'])}
	+
	+ return None


	def add_tagged_report_number(reading_line,
	len_reportnum,
	reportnum,
	startpos,
	true_replacement_index,
	extras):
	"""In rebuilding the line, add an identified institutional REPORT-NUMBER
	(standardised and tagged) into the line.
	@param reading_line: (string) The reference line before capitalization
	was performed, and before REPORT-NUMBERs and TITLEs were stipped out.
	@param len_reportnum: (integer) the length of the matched REPORT-NUMBER.
	@param reportnum: (string) the replacement text for the matched
	REPORT-NUMBER.
	@param startpos: (integer) the pointer to the next position in the
	reading-line from which to start rebuilding.
	@param true_replacement_index: (integer) the replacement index of the
	matched REPORT-NUMBER in the reading-line, with stripped punctuation
	and whitespace accounted for.
	@param extras: (integer) extras to be added into the replacement index.
	@return: (tuple) containing a string (the rebuilt line segment) and an
	integer (the next 'startpos' in the reading-line).
	"""
	rebuilt_line = u"" # The segment of the line that's being rebuilt to
	# include the tagged & standardised REPORT-NUMBER

	# Fill rebuilt_line with the contents of the reading_line up to the point
	# of the institutional REPORT-NUMBER. However, stop 1 character before the
	# replacement index of this REPORT-NUMBER to allow for removal of braces,
	# if necessary:
	if (true_replacement_index - startpos - 1) >= 0:
	rebuilt_line += reading_line[startpos:true_replacement_index - 1]
	else:
	rebuilt_line += reading_line[startpos:true_replacement_index]

	# check to see whether the REPORT-NUMBER was enclosed within brackets;
	# drop them if so:
	if reading_line[true_replacement_index - 1] not in (u"[", u"("):
	# no braces enclosing the REPORT-NUMBER:
	rebuilt_line += reading_line[true_replacement_index - 1]

	# Add the tagged REPORT-NUMBER into the rebuilt-line segment:
	rebuilt_line += u"<cds.REPORTNUMBER>%(reportnum)s</cds.REPORTNUMBER>" \
	% {'reportnum' : reportnum}

	# Move the pointer in the reading-line past the current match:
	startpos = true_replacement_index + len_reportnum + extras

	# Move past closing brace for report number (if there was one):
	try:
	if reading_line[startpos] in (u"]", u")"):
	startpos += 1
	except IndexError:
	# moved past end of line - ignore
	pass

	# return the rebuilt-line segment and the pointer to the next position in
	# the reading-line from which to start rebuilding up to the next match:
	return rebuilt_line, startpos


	-def add_tagged_journal_in_place_of_IBID(previous_match,
	- ibid_series):
	+def add_tagged_journal_in_place_of_IBID(previous_match):
	"""In rebuilding the line, if the matched TITLE was actually an IBID, this
	function will replace it with the previously matched TITLE, and add it
	into the line, tagged. It will even handle the series letter, if it
	differs. For example, if the previous match is "Nucl. Phys. B", and
	the ibid is "IBID A", the title inserted into the line will be
	"Nucl. Phys. A". Otherwise, if the IBID had no series letter, it will
	simply be replaced by "Nucl. Phys. B" (i.e. the previous match.)
	@param previous_match: (string) - the previously matched TITLE.
	@param ibid_series: (string) - the series of the IBID (if any).
	@return: (tuple) containing a string (the rebuilt line segment) and an
	other string (the newly updated previous-match).
	"""

	- IBID_start_tag = " <cds.JOURNALibid>"
	-
	- rebuilt_line = u""
	- if ibid_series != "":
	- # This IBID has a series letter. If the previously matched TITLE also
	- # had a series letter and that series letter differs to the one
	- # carried by this IBID, the series letter stored in the previous-match
	- # must be updated to that of this IBID
	- # (i.e. Keep the series letter paired directly with the IBID):
	- if previous_match['series'] is not None:
	- # Previous match had a series:
	- if previous_match['series'] != ibid_series:
	- # Previous match and this IBID do not have the same series
	- previous_match['series'] = ibid_series
	-
	- rebuilt_line += IBID_start_tag + "%(previous-match)s" \
	- % {'previous-match' : previous_match['title']} + \
	- CFG_REFEXTRACT_MARKER_CLOSING_TITLE_IBID + \
	- " : " + previous_match['series']
	- else:
	- # Previous match had no recognised series but the IBID did. Add a
	- # the series letter to the end of the previous match.
	- previous_match['series'] = ibid_series
	- rebuilt_line += IBID_start_tag + "%(previous-match)s" \
	- % {'previous-match' : previous_match['title']} + \
	- CFG_REFEXTRACT_MARKER_CLOSING_TITLE_IBID + \
	- " : " + previous_match['series']
	+ return " %s%s%s" % (CFG_REFEXTRACT_MARKER_OPENING_TITLE_IBID,
	+ previous_match['title'],
	+ CFG_REFEXTRACT_MARKER_CLOSING_TITLE_IBID)

	- else:
	- if previous_match['series'] is not None:
	- # Both the previous match & this IBID have the same series
	- rebuilt_line += IBID_start_tag + "%(previous-match)s" \
	- % {'previous-match' : previous_match['title']} + \
	- CFG_REFEXTRACT_MARKER_CLOSING_TITLE_IBID + \
	- " : " + previous_match['series']
	- else:
	- # This IBID has no series letter.
	- # If a previous series is present append it.
	- rebuilt_line += IBID_start_tag + "%(previous-match)s" \
	- % {'previous-match' : previous_match['title']} + \
	- CFG_REFEXTRACT_MARKER_CLOSING_TITLE_IBID

	- return rebuilt_line, previous_match
	+def extract_series_from_volume(volume):
	+ patterns = (re_series_from_numeration,
	+ re_series_from_numeration_after_volume)
	+ for p in patterns:
	+ match = p.search(volume)
	+ if match:
	+ return match.group(1)
	+ return None
	+
	+
	+def create_numeration_tag(info):
	+ if info['series']:
	+ series_and_volume = info['series'] + info['volume']
	+ else:
	+ series_and_volume = info['volume']
	+ numeration_tags = u' <cds.VOL>%s</cds.VOL>' % series_and_volume
	+ if info.get('year', False):
	+ numeration_tags += u' <cds.YR>(%(year)s)</cds.YR>' % info
	+ numeration_tags += u' <cds.PG>%(page)s</cds.PG>' % info
	+ return numeration_tags


	def add_tagged_journal(reading_line,
	- len_title,
	- matched_title,
	+ journal_info,
	previous_match,
	startpos,
	true_replacement_index,
	extras,
	standardised_titles):
	"""In rebuilding the line, add an identified periodical TITLE (standardised
	and tagged) into the line.
	@param reading_line: (string) The reference line before capitalization
	was performed, and before REPORT-NUMBERs and TITLEs were stripped out.
	@param len_title: (integer) the length of the matched TITLE.
	@param matched_title: (string) the matched TITLE text.
	- @param previous_match: (string) the previous periodical TITLE citation to
	+ @param previous_match: (dict) the previous periodical TITLE citation to
	have been matched in the current reference line. It is used when
	replacing an IBID instance in the line.
	@param startpos: (integer) the pointer to the next position in the
	reading-line from which to start rebuilding.
	@param true_replacement_index: (integer) the replacement index of the
	matched TITLE in the reading-line, with stripped punctuation and
	whitespace accounted for.
	@param extras: (integer) extras to be added into the replacement index.
	@param standardised_titles: (dictionary) the standardised versions of
	periodical titles, keyed by their various non-standard versions.
	@return: (tuple) containing a string (the rebuilt line segment), an
	integer (the next 'startpos' in the reading-line), and an other string
	(the newly updated previous-match).
	"""
	+ old_startpos = startpos
	+ old_previous_match = previous_match
	+ skip_numeration = False
	+ series = None
	+
	+ def skip_ponctuation(line, pos):
	+ # Skip past any punctuation at the end of the replacement that was
	+ # just made:
	+ try:
	+ while line[pos] in (".", ":", "-", ")"):
	+ pos += 1
	+ except IndexError:
	+ # The match was at the very end of the line
	+ pass
	+
	+ return pos
	+
	# Fill 'rebuilt_line' (the segment of the line that is being rebuilt to
	# include the tagged and standardised periodical TITLE) with the contents
	# of the reading-line, up to the point of the matched TITLE:
	rebuilt_line = reading_line[startpos:true_replacement_index]
	+
	# Test to see whether a title or an "IBID" was matched:
	- if matched_title.upper().find("IBID") != -1:
	+ if journal_info.upper().find("IBID") != -1:
	# This is an IBID
	# Try to replace the IBID with a title:
	- if len(previous_match) > 1:
	- # A title has already been replaced in this line - IBID can be
	- # replaced meaninfully First, try to get the series number/letter
	- # of this IBID:
	- m_ibid = re_matched_ibid.search(matched_title)
	- try:
	- series = m_ibid.group(1)
	- except IndexError:
	- series = u""
	- if series is None:
	- series = u""
	+ if previous_match:
	# Replace this IBID with the previous title match, if possible:
	- (replaced_ibid_segment, previous_match) = \
	- add_tagged_journal_in_place_of_IBID(previous_match, series)
	- rebuilt_line += replaced_ibid_segment
	+ rebuilt_line += add_tagged_journal_in_place_of_IBID(previous_match)
	+ series = previous_match['series']
	# Update start position for next segment of original line:
	- startpos = true_replacement_index + len_title + extras
	-
	- # Skip past any punctuation at the end of the replacement that was
	- # just made:
	- try:
	- if reading_line[startpos] in (".", ":", ";", "-"):
	- startpos += 1
	- except IndexError:
	- # The match was at the very end of the line
	- pass
	+ startpos = true_replacement_index + len(journal_info) + extras
	+ startpos = skip_ponctuation(reading_line, startpos)
	else:
	- # no previous title-replacements in this line - IBID refers to
	- # something unknown and cannot be replaced:
	- rebuilt_line += \
	- reading_line[true_replacement_index:true_replacement_index \
	- + len_title + extras]
	- startpos = true_replacement_index + len_title + extras
	+ rebuilt_line = ""
	+ skip_numeration = True
	else:
	+ if ';' in standardised_titles[journal_info]:
	+ title, series = \
	+ standardised_titles[journal_info].rsplit(';', 1)
	+ series = series.strip()
	+ previous_match = {'title': title,
	+ 'series': series}
	+ else:
	+ title = standardised_titles[journal_info]
	+ previous_match = {'title': title,
	+ 'series': None}
	+
	# This is a normal title, not an IBID
	- rebuilt_line += "<cds.JOURNAL>%(title)s</cds.JOURNAL>" \
	- % {'title' : standardised_titles[matched_title]}
	- previous_title = standardised_titles[matched_title]
	- startpos = true_replacement_index + len_title + extras
	-
	- # Try to get the series of this just added title, by dynamically finding it
	- # from the text after the title tag (rather than from the kb)
	- # Ideally, the series will be found with the numeration after the title
	- # but will also check the ending of the title text if this match fails.
	- previous_series_from_numeration = re_series_from_numeration.search(reading_line[startpos:])
	- previous_series_from_title = re_series_from_title.search(previous_title)
	- # First, try to obtain the series from the identified numeration
	- if previous_series_from_numeration:
	- previous_match = {
	- 'title' : previous_title,
	- 'series': previous_series_from_numeration.group(1),
	- }
	- # If it isn't found, try to get it from the standardised title
	- # BUT ONLY if the numeration matched!
	- elif previous_series_from_title and previous_series_from_title.group(3):
	- previous_match = {
	- 'title' : previous_series_from_title.group(1),
	- 'series': previous_series_from_title.group(3),
	- }
	+ rebuilt_line += "<cds.JOURNAL>%s</cds.JOURNAL>" % title
	+ startpos = true_replacement_index + len(journal_info) + extras
	+ startpos = skip_ponctuation(reading_line, startpos)
	+
	+ if not skip_numeration:
	+ # Check for numeration
	+ numeration_line = reading_line[startpos:]
	+ # First look for standard numeration
	+ numerotation_info = find_numeration(numeration_line)
	+ if not numerotation_info:
	+ numeration_line = rebuilt_line + " " + numeration_line
	+ # Now look for more funky numeration
	+ # With possibly some elements before the journal title
	+ numerotation_info = find_numeration_more(numeration_line)
	+
	+ if not numerotation_info:
	+ startpos = old_startpos
	+ previous_match = old_previous_match
	+ rebuilt_line = ""
	else:
	- previous_match = {'title': previous_title, 'series': None}
	+ if series and not numerotation_info['series']:
	+ numerotation_info['series'] = series
	+ startpos += numerotation_info['len']
	+ rebuilt_line += create_numeration_tag(numerotation_info)

	- # Skip past any punctuation at the end of the replacement that was
	- # just made:
	- try:
	- if reading_line[startpos] in (".", ":", ";", "-"):
	- startpos += 1
	- except IndexError:
	- # The match was at the very end of the line
	- pass
	- try:
	- if reading_line[startpos] == ")":
	- startpos += 1
	- except IndexError:
	- # The match was at the very end of the line
	- pass
	+ previous_match['series'] = numerotation_info['series']

	# return the rebuilt line-segment, the position (of the reading line) from
	# which the next part of the rebuilt line should be started, and the newly
	# updated previous match.
	-
	return rebuilt_line, startpos, previous_match


	def add_tagged_publisher(reading_line,
	matched_publisher,
	startpos,
	true_replacement_index,
	extras,
	kb_publishers):
	"""In rebuilding the line, add an identified periodical TITLE (standardised
	and tagged) into the line.
	@param reading_line: (string) The reference line before capitalization
	was performed, and before REPORT-NUMBERs and TITLEs were stripped out.
	@param len_title: (integer) the length of the matched TITLE.
	@param matched_title: (string) the matched TITLE text.
	@param previous_match: (string) the previous periodical TITLE citation to
	have been matched in the current reference line. It is used when
	replacing an IBID instance in the line.
	@param startpos: (integer) the pointer to the next position in the
	reading-line from which to start rebuilding.
	@param true_replacement_index: (integer) the replacement index of the
	matched TITLE in the reading-line, with stripped punctuation and
	whitespace accounted for.
	@param extras: (integer) extras to be added into the replacement index.
	@param standardised_titles: (dictionary) the standardised versions of
	periodical titles, keyed by their various non-standard versions.
	@return: (tuple) containing a string (the rebuilt line segment), an
	integer (the next 'startpos' in the reading-line), and an other string
	(the newly updated previous-match).
	"""
	# Fill 'rebuilt_line' (the segment of the line that is being rebuilt to
	# include the tagged and standardised periodical TITLE) with the contents
	# of the reading-line, up to the point of the matched TITLE:
	rebuilt_line = reading_line[startpos:true_replacement_index]
	# This is a normal title, not an IBID
	rebuilt_line += "<cds.PUBLISHER>%(title)s</cds.PUBLISHER>" \
	- % {'title' : kb_publishers[matched_publisher]}
	+ % {'title' : kb_publishers[matched_publisher]['repl']}
	# Compute new start pos
	startpos = true_replacement_index + len(matched_publisher) + extras

	# return the rebuilt line-segment, the position (of the reading line) from
	# which the next part of the rebuilt line should be started, and the newly
	# updated previous match.

	return rebuilt_line, startpos


	def get_replacement_types(titles, reportnumbers, publishers):
	"""Given the indices of the titles and reportnumbers that have been
	recognised within a reference line, create a dictionary keyed by
	the replacement position in the line, where the value for each
	key is a string describing the type of item replaced at that
	position in the line.
	The description strings are:
	'title' - indicating that the replacement is a
	periodical title
	'reportnumber' - indicating that the replacement is a
	preprint report number.
	@param titles: (list) of locations in the string at which
	periodical titles were found.
	@param reportnumbers: (list) of locations in the string at which
	reportnumbers were found.
	@return: (dictionary) of replacement types at various locations
	within the string.
	"""
	rep_types = {}
	for item_idx in titles:
	rep_types[item_idx] = "journal"
	for item_idx in reportnumbers:
	rep_types[item_idx] = "reportnumber"
	for item_idx in publishers:
	rep_types[item_idx] = "publisher"
	return rep_types


	def account_for_stripped_whitespace(spaces_keys,
	removed_spaces,
	replacement_types,
	len_reportnums,
	- len_titles,
	+ journals_matches,
	replacement_index):
	"""To build a processed (MARC XML) reference line in which the
	recognised citations such as standardised periodical TITLEs and
	REPORT-NUMBERs have been marked up, it is necessary to read from
	the reference line BEFORE all punctuation was stripped and it was
	made into upper-case. The indices of the cited items in this
	'original line', however, will be different to those in the
	'working-line', in which punctuation and multiple-spaces were
	stripped out. For example, the following reading-line:

	[26] E. Witten and S.-T. Yau, hep-th/9910245.
	...becomes (after punctuation and multiple white-space stripping):
	[26] E WITTEN AND S T YAU HEP TH/9910245

	It can be seen that the report-number citation (hep-th/9910245) is
	at a different index in the two strings. When refextract searches
	for this citation, it uses the 2nd string (i.e. that which is
	capitalised and has no punctuation). When it builds the MARC XML
	representation of the reference line, however, it needs to read from
	the first string. It must therefore consider the whitespace,
	punctuation, etc that has been removed, in order to get the correct
	index for the cited item. This function accounts for the stripped
	characters before a given TITLE or REPORT-NUMBER index.
	@param spaces_keys: (list) - the indices at which spaces were
	removed from the reference line.
	@param removed_spaces: (dictionary) - keyed by the indices at which
	spaces were removed from the line, the values are the number of
	spaces actually removed from that position.
	So, for example, "3 spaces were removed from position 25 in
	the line."
	@param replacement_types: (dictionary) - at each 'replacement_index'
	in the line, the of replacement to make (title or reportnumber).
	@param len_reportnums: (dictionary) - the lengths of the REPORT-
	NUMBERs matched at the various indices in the line.
	@param len_titles: (dictionary) - the lengths of the various
	TITLEs matched at the various indices in the line.
	@param replacement_index: (integer) - the index in the working line
	of the identified TITLE or REPORT-NUMBER citation.
	@return: (tuple) containing 2 elements:
	+ the true replacement index of a replacement in
	the reading line;
	+ any extras to add into the replacement index;
	"""
	extras = 0
	true_replacement_index = replacement_index
	spare_replacement_index = replacement_index

	for space in spaces_keys:
	if space < true_replacement_index:
	# There were spaces stripped before the current replacement
	# Add the number of spaces removed from this location to the
	# current replacement index:
	- true_replacement_index += removed_spaces[space]
	+ true_replacement_index += removed_spaces[space]
	spare_replacement_index += removed_spaces[space]
	- elif (space >= spare_replacement_index) and \
	- (replacement_types[replacement_index] == u"journal") and \
	- (space < (spare_replacement_index + \
	- len_titles[replacement_index])):
	+ elif space >= spare_replacement_index and \
	+ replacement_types[replacement_index] == u"journal" and \
	+ space < (spare_replacement_index + \
	+ len(journals_matches[replacement_index])):
	# A periodical title is being replaced. Account for multi-spaces
	# that may have been stripped from the title before its
	# recognition:
	spare_replacement_index += removed_spaces[space]
	extras += removed_spaces[space]
	- elif (space >= spare_replacement_index) and \
	- (replacement_types[replacement_index] == u"reportnumber") and \
	- (space < (spare_replacement_index + \
	- len_reportnums[replacement_index])):
	+ elif space >= spare_replacement_index and \
	+ replacement_types[replacement_index] == u"reportnumber" and \
	+ space < (spare_replacement_index + \
	+ len_reportnums[replacement_index]):
	# An institutional preprint report-number is being replaced.
	# Account for multi-spaces that may have been stripped from it
	# before its recognition:
	spare_replacement_index += removed_spaces[space]
	extras += removed_spaces[space]

	# return the new values for replacement indices with stripped
	# whitespace accounted for:
	return true_replacement_index, extras


	def strip_tags(line):
	# Firstly, go through and change ALL TAGS and their contents to underscores
	# author content can be checked for underscores later on
	# Note that we don't have embedded tags this is why
	# we can do this
	re_tag = re.compile(ur'<cds\.[A-Z]+>[^<]*</cds\.[A-Z]+>\|<cds\.[A-Z]+ />',
	re.UNICODE)
	for m in re_tag.finditer(line):
	chars_count = m.end() - m.start()
	line = re_tag.sub('_'*chars_count, line, count=1)
	return line


	+def identify_and_tag_collaborations(line, collaborations_kb):
	+ """Given a line where Authors have been tagged, and all other tags
	+ and content has been replaced with underscores, go through and try
	+ to identify extra items of data which should be placed into 'h'
	+ subfields.
	+ Later on, these tagged pieces of information will be merged into
	+ the content of the most recently found author. This is separated
	+ from the author tagging procedure since separate tags can be used,
	+ which won't influence the reference splitting heuristics
	+ (used when looking at mulitple <AUTH> tags in a line).
	+ """
	+ for dummy, re_collab in collaborations_kb.iteritems():
	+ matches = re_collab.finditer(strip_tags(line))
	+
	+ for match in reversed(list(matches)):
	+ line = line[:match.start()] \
	+ + CFG_REFEXTRACT_MARKER_OPENING_COLLABORATION \
	+ + match.group(1).strip(".,:;- [](){}") \
	+ + CFG_REFEXTRACT_MARKER_CLOSING_COLLABORATION \
	+ + line[match.end():]
	+
	+ return line
	+
	+
	def identify_and_tag_authors(line, authors_kb):
	"""Given a reference, look for a group of author names,
	place tags around the author group, return the newly tagged line.
	"""
	- def identify_and_tag_extra_authors(line):
	- """Given a line where Authors have been tagged, and all other tags
	- and content has been replaced with underscores, go through and try
	- to identify extra items of data which should be placed into 'h'
	- subfields.
	- Later on, these tagged pieces of information will be merged into
	- the content of the most recently found author. This is separated
	- from the author tagging procedure since separate tags can be used,
	- which won't influence the reference splitting heuristics
	- (used when looking at mulitple <AUTH> tags in a line).
	- """
	- if re_extra_auth:
	- extra_authors = re_extra_auth.finditer(line)
	- positions = []
	- for match in extra_authors:
	- positions.append({'start' : match.start(),
	- 'end' : match.end(),
	- 'author' : match.group('extra_auth')})
	- positions.reverse()
	- for p in positions:
	- line = line[:p['start']] \
	- + "<cds.AUTHincl>" \
	- + p['author'].strip(".,:;- []") \
	- + CFG_REFEXTRACT_MARKER_CLOSING_AUTHOR_INCL \
	- + line[p['end']:]
	- return line

	# Replace authors which do not convert well from utf-8
	for pattern, repl in authors_kb:
	line = line.replace(pattern, repl)

	output_line = line

	- line = strip_tags(line)
	+ line = strip_tags(unidecode(line))
	+ if len(line) != len(output_line):
	+ output_line = unidecode(output_line)
	+ line = strip_tags(output_line)

	# Find as many author groups (collections of author names) as possible from the 'title-hidden' line
	matched_authors = re_auth.finditer(line)

	- # Debug print
	- # print 'matching authors on: %r' % line
	-
	# If there is at least one matched author group
	if matched_authors:
	matched_positions = []
	preceeding_text_string = line
	preceeding_text_start = 0
	for auth_no, match in enumerate(matched_authors):
	- # Debug print author groups
	- #print 'authors matches:'
	- #print match.groupdict()
	-
	# Only if there are no underscores or closing arrows found in the matched author group
	# This must be checked for here, as it cannot be applied to the re without clashing with
	# other Unicode characters
	if line[match.start():match.end()].find("_") == -1:
	# Has the group with name 'et' (for 'et al') been found in the pattern?
	# Has the group with name 'es' (for ed. before the author) been found in the pattern?
	# Has the group with name 'ee' (for ed. after the author) been found in the pattern?
	matched_positions.append({
	'start' : match.start(),
	'end' : match.end(),
	'etal' : match.group('et') or match.group('et2'),
	'ed_start' : match.group('es'),
	'ed_end' : match.group('ee'),
	'multi_auth' : match.group('multi_auth'),
	'multi_surs' : match.group('multi_surs'),
	'text_before' : preceeding_text_string[preceeding_text_start:match.start()],
	'auth_no' : auth_no,
	'author_names': match.group('author_names')
	})
	# Save the end of the match, from where to snip the misc text found before an author match
	preceeding_text_start = match.end()

	# Work backwards to avoid index problems when adding AUTH tags
	matched_positions.reverse()
	for m in matched_positions:
	dump_in_misc = False
	start = m['start']
	end = m['end']

	# Check the text before the current match to see if it has a bad 'et al'
	lower_text_before = m['text_before'].strip().lower()
	for e in etal_matches:
	if lower_text_before.endswith(e):
	## If so, this author match is likely to be a bad match on a missed title
	dump_in_misc = True
	break

	# An AND found here likely indicates a missed author before this text
	# Thus, triggers weaker author searching, within the previous misc text
	# (Check the text before the current match to see if it has a bad 'and')
	# A bad 'and' will only be denoted as such if there exists only one author after it
	# and the author group is legit (not to be dumped in misc)
	if not dump_in_misc and not (m['multi_auth'] or m['multi_surs']) \
	- and (lower_text_before.endswith(' and')):
	+ and (lower_text_before.endswith(' and')):
	# Search using a weaker author pattern to try and find the missed author(s) (cut away the end 'and')
	weaker_match = re_auth_near_miss.match(m['text_before'])
	if weaker_match and not (weaker_match.group('es') or weaker_match.group('ee')):
	# Change the start of the author group to include this new author group
	start = start - (len(m['text_before']) - weaker_match.start())
	# Still no match, do not add tags for this author match.. dump it into misc
	else:
	dump_in_misc = True

	add_to_misc = ""
	# If a semi-colon was found at the end of this author group, keep it in misc
	# so that it can be looked at for splitting heurisitics
	if len(output_line) > m['end']:
	if output_line[m['end']].strip(" ,.") == ';':
	add_to_misc = ';'

	# Standardize eds. notation
	tmp_output_line = re.sub(re_ed_notation, '(ed.)',
	output_line[start:end], re.IGNORECASE)
	# Standardize et al. notation
	tmp_output_line = re.sub(re_etal, 'et al.',
	tmp_output_line, re.IGNORECASE)
	# Strip
	- tmp_output_line = tmp_output_line.strip(",:;- [](")
	+ tmp_output_line = tmp_output_line.lstrip('.').strip(",:;- [](")
	+ if not tmp_output_line.endswith('(ed.)'):
	+ tmp_output_line = tmp_output_line.strip(')')

	# ONLY wrap author data with tags IF there is no evidence that it is an
	# ed. author. (i.e. The author is not referred to as an editor)
	# Does this author group string have 'et al.'?
	if m['etal'] and not (m['ed_start'] or m['ed_end'] or dump_in_misc):
	output_line = output_line[:start] \
	+ "<cds.AUTHetal>" \
	+ tmp_output_line \
	+ CFG_REFEXTRACT_MARKER_CLOSING_AUTHOR_ETAL \
	+ add_to_misc \
	+ output_line[end:]
	elif not (m['ed_start'] or m['ed_end'] or dump_in_misc):
	# Insert the std (standard) tag
	output_line = output_line[:start] \
	+ "<cds.AUTHstnd>" \
	+ tmp_output_line \
	+ CFG_REFEXTRACT_MARKER_CLOSING_AUTHOR_STND \
	+ add_to_misc \
	+ output_line[end:]
	# Apply the 'include in $h' method to author groups marked as editors
	elif m['ed_start'] or m['ed_end']:
	ed_notation = " (eds.)"
	# Standardize et al. notation
	tmp_output_line = re.sub(re_etal, 'et al.',
	m['author_names'], re.IGNORECASE)
	# remove any characters which denote this author group
	# to be editors, just take the
	# author names, and append '(ed.)'
	output_line = output_line[:start] \
	+ "<cds.AUTHincl>" \
	+ tmp_output_line.strip(",:;- [](") \
	+ ed_notation \
	+ CFG_REFEXTRACT_MARKER_CLOSING_AUTHOR_INCL \
	+ add_to_misc \
	+ output_line[end:]

	- # Now that authors have been tagged, search for the extra information which should be included in $h
	- # Tag for this datafield, merge into one $h subfield later on
	- output_line = identify_and_tag_extra_authors(output_line)
	-
	return output_line


	def sum_2_dictionaries(dicta, dictb):
	"""Given two dictionaries of totals, where each total refers to a key
	in the dictionary, add the totals.
	E.g.: dicta = { 'a' : 3, 'b' : 1 }
	dictb = { 'a' : 1, 'c' : 5 }
	dicta + dictb = { 'a' : 4, 'b' : 1, 'c' : 5 }
	@param dicta: (dictionary)
	@param dictb: (dictionary)
	@return: (dictionary) - the sum of the 2 dictionaries
	"""
	dict_out = dicta.copy()
	for key in dictb.keys():
	if 'key' in dict_out:
	# Add the sum for key in dictb to that of dict_out:
	dict_out[key] += dictb[key]
	else:
	# the key is not in the first dictionary - add it directly:
	dict_out[key] = dictb[key]
	return dict_out


	-def tag_numeration(line):
	+def identify_ibids(line):
	+ """Find IBIDs within the line, record their position and length,
	+ and replace them with underscores.
	+ @param line: (string) the working reference line
	+ @return: (tuple) containing 2 dictionaries and a string:
	+ Dictionary: matched IBID text: (Key: position of IBID in
	+ line; Value: matched IBID text)
	+ String: working line with matched IBIDs removed
	+ """
	+ ibid_match_txt = {}
	+ # Record details of each matched ibid:
	+ for m_ibid in re_ibid.finditer(line):
	+ ibid_match_txt[m_ibid.start()] = m_ibid.group(0)
	+ # Replace matched text in line with underscores:
	+ line = line[0:m_ibid.start()] + \
	+ "_" * len(m_ibid.group(0)) + \
	+ line[m_ibid.end():]
	+
	+ return ibid_match_txt, line
	+
	+
	+def find_all(string, sub):
	+ listindex = []
	+ offset = 0
	+ i = string.find(sub, offset)
	+ while i >= 0:
	+ listindex.append(i)
	+ i = string.find(sub, i + 1)
	+ return listindex
	+
	+
	+def find_numeration(line):
	"""Given a reference line, attempt to locate instances of citation
	'numeration' in the line.
	- Upon finding some numeration, re-arrange it into a standard
	- order, and mark it up with tags.
	- Will process numeration in the following order:
	- Delete the colon and expressions such as Serie, vol, V.
	- inside the pattern <serie : volume>
	- E.g.: Replace the string 'Series A, Vol 4' with 'A 4'
	- Then, the 4 main numeration patterns:
	- Pattern 0 (was pattern 3): <x, vol, page, year>
	- <v, [FS]?, p, y>
	- <[FS]?, v, p, y>
	- Pattern 1: <x, vol, year, page>
	- <v, [FS]?, y, p>
	- <[FS]?, v, y, p>
	- Pattern 2: <vol, serie, year, page>
	- <v, s, [FS]?, y, p>
	- <v, [FS]?, s, y, p
	- Pattern 4: <vol, serie, page, year>
	- <v, s, [FS]?, p, y>
	- <v, [FS]?, s, p, y>
	-
	@param line: (string) the reference line.
	@return: (string) the reference line after numeration has been checked
	and possibly recognized/marked-up.
	"""
	patterns = (
	- #re_strip_series_and_volume_labels,
	-
	# vol,page,year
	re_numeration_vol_page_yr,
	re_numeration_vol_nucphys_page_yr,
	re_numeration_nucphys_vol_page_yr,
	# With sub volume
	re_numeration_vol_subvol_nucphys_yr_page,
	re_numeration_vol_nucphys_yr_subvol_page,
	# vol,year,page
	re_numeration_vol_yr_page,
	re_numeration_nucphys_vol_yr_page,
	re_numeration_vol_nucphys_series_yr_page,
	# vol,page,year
	re_numeration_vol_series_nucphys_page_yr,
	re_numeration_vol_nucphys_series_page_yr,
	# year,vol,page
	re_numeration_yr_vol_page,
	)

	- for pattern, replacement in patterns:
	- line = pattern.sub(replacement, line)
	-
	- return line
	-
	-
	-def identify_ibids(line):
	- """Find IBIDs within the line, record their position and length,
	- and replace them with underscores.
	- @param line: (string) the working reference line
	- @return: (tuple) containing 2 dictionaries and a string:
	- Dictionary 1: matched IBID lengths (Key: position of IBID
	- in line; Value: length of matched IBID)
	- Dictionary 2: matched IBID text: (Key: position of IBID in
	- line; Value: matched IBID text)
	- String: working line with matched IBIDs removed
	- """
	- ibid_match_len = {}
	- ibid_match_txt = {}
	- ibid_matches_iter = re_ibid.finditer(line)
	-
	- ## Record details of each matched ibid:
	- for m_ibid in ibid_matches_iter:
	- ibid_match_len[m_ibid.start()] = len(m_ibid.group(2))
	- ibid_match_txt[m_ibid.start()] = m_ibid.group(2)
	- ## Replace matched text in line with underscores:
	- line = line[0:m_ibid.start(2)] + "_"*len(m_ibid.group(2)) + \
	- line[m_ibid.end(2):]
	-
	- return ibid_match_len, ibid_match_txt, line
	-
	-
	-def find_all(string, sub):
	- listindex = []
	- offset = 0
	- i = string.find(sub, offset)
	- while i >= 0:
	- listindex.append(i)
	- i = string.find(sub, i + 1)
	- return listindex
	+ for pattern in patterns:
	+ match = pattern.match(line)
	+ if match:
	+ info = match.groupdict()
	+ series = info.get('series', None)
	+ if not series:
	+ series = extract_series_from_volume(info['vol'])
	+ if not info['vol_num']:
	+ info['vol_num'] = info['vol_num_alt']
	+ if not info['vol_num']:
	+ info['vol_num'] = info['vol_num_alt2']
	+ return {'year': info.get('year', None),
	+ 'series': series,
	+ 'volume': info['vol_num'],
	+ 'page': info['page'],
	+ 'len': match.end()}
	+
	+ return None


	def identify_journals(line, kb_journals):
	"""Attempt to identify all periodical titles in a reference line.
	Titles will be identified, their information (location in line,
	length in line, and non-standardised version) will be recorded,
	and they will be replaced in the working line by underscores.
	@param line: (string) - the working reference line.
	@param periodical_title_search_kb: (dictionary) - contains the
	regexp patterns used to search for a non-standard TITLE in the
	working reference line. Keyed by the TITLE string itself.
	@param periodical_title_search_keys: (list) - contains the non-
	standard periodical TITLEs to be searched for in the line. This
	list of titles has already been ordered and is used to force
	the order of searching.
	@return: (tuple) containing 4 elements:
	+ (dictionary) - the lengths of all titles
	matched at each given index
	within the line.
	+ (dictionary) - the text actually matched for
	each title at each given
	index within the line.
	+ (string) - the working line, with the
	titles removed from it and
	replaced by underscores.
	+ (dictionary) - the totals for each bad-title
	found in the line.
	"""
	- periodical_title_search_kb = kb_journals[0]
	+ periodical_title_search_kb = kb_journals[0]
	periodical_title_search_keys = kb_journals[2]

	- title_matches_matchlen = {} # info about lengths of periodical titles
	- # matched at given locations in the line
	- title_matches_matchtext = {} # the text matched at the given line
	+ title_matches = {} # the text matched at the given line
	# location (i.e. the title itself)
	titles_count = {} # sum totals of each 'bad title found in
	# line.

	# Begin searching:
	for title in periodical_title_search_keys:
	# search for all instances of the current periodical title
	# in the line:
	# for each matched periodical title:
	for title_match in periodical_title_search_kb[title].finditer(line):

	if title not in titles_count:
	# Add this title into the titles_count dictionary:
	titles_count[title] = 1
	else:
	# Add 1 to the count for the given title:
	titles_count[title] += 1

	# record the details of this title match:
	# record the match length:
	- title_matches_matchlen[title_match.start()] = len(title)
	+ title_matches[title_match.start()] = title

	- # record the matched non-standard version of the title:
	- title_matches_matchtext[title_match.start()] = title
	+ len_to_replace = len(title)

	# replace the matched title text in the line it n * '_',
	# where n is the length of the matched title:
	line = u"".join((line[:title_match.start()],
	- u"_"*len(title),
	- line[title_match.start()+len(title):]))
	+ u"_" * len_to_replace,
	+ line[title_match.start() + len_to_replace:]))

	# return recorded information about matched periodical titles,
	# along with the newly changed working line:
	- return title_matches_matchlen, title_matches_matchtext, line, titles_count
	+ return title_matches, line, titles_count


	def identify_report_numbers(line, kb_reports):
	"""Attempt to identify all preprint report numbers in a reference
	line.
	Report numbers will be identified, their information (location
	in line, length in line, and standardised replacement version)
	will be recorded, and they will be replaced in the working-line
	by underscores.
	@param line: (string) - the working reference line.
	@param preprint_repnum_search_kb: (dictionary) - contains the
	regexp patterns used to identify preprint report numbers.
	@param preprint_repnum_standardised_categs: (dictionary) -
	contains the standardised 'category' of a given preprint report
	number.
	@return: (tuple) - 3 elements:
	* a dictionary containing the lengths in the line of the
	matched preprint report numbers, keyed by the index at
	which each match was found in the line.
	* a dictionary containing the replacement strings (standardised
	versions) of preprint report numbers that were matched in
	the line.
	* a string, that is the new version of the working reference
	line, in which any matched preprint report numbers have been
	replaced by underscores.
	Returned tuple is therefore in the following order:
	(matched-reportnum-lengths, matched-reportnum-replacements,
	working-line)
	"""
	def _by_len(a, b):
	"""Comparison function used to sort a list by the length of the
	strings in each element of the list.
	"""
	if len(a[1]) < len(b[1]):
	return 1
	elif len(a[1]) == len(b[1]):
	return 0
	else:
	return -1

	repnum_matches_matchlen = {} # info about lengths of report numbers
	# matched at given locations in line
	repnum_matches_repl_str = {} # standardised report numbers matched
	# at given locations in line

	preprint_repnum_search_kb, preprint_repnum_standardised_categs = kb_reports
	preprint_repnum_categs = preprint_repnum_standardised_categs.keys()
	preprint_repnum_categs.sort(_by_len)

	# Handle CERN/LHCC/98-013
	line = line.replace('/', ' ')

	# try to match preprint report numbers in the line:
	for categ in preprint_repnum_categs:
	# search for all instances of the current report
	# numbering style in the line:
	repnum_matches_iter = preprint_repnum_search_kb[categ].finditer(line)

	# for each matched report number of this style:
	for repnum_match in repnum_matches_iter:
	# Get the matched text for the numeration part of the
	# preprint report number:
	numeration_match = repnum_match.group('numn')
	# clean/standardise this numeration text:
	numeration_match = numeration_match.replace(" ", "-")
	numeration_match = re_multiple_hyphens.sub("-", numeration_match)
	numeration_match = numeration_match.replace("/-", "/")
	numeration_match = numeration_match.replace("-/", "/")
	numeration_match = numeration_match.replace("-/-", "/")

	# replace the found preprint report number in the
	# string with underscores
	# (this will replace chars in the lower-cased line):
	line = line[0:repnum_match.start(1)] \
	+ "_"*len(repnum_match.group(1)) + line[repnum_match.end(1):]
	# record the information about the matched preprint report number:
	# total length in the line of the matched preprint report number:
	repnum_matches_matchlen[repnum_match.start(1)] = \
	len(repnum_match.group(1))
	# standardised replacement for the matched preprint report number:
	repnum_matches_repl_str[repnum_match.start(1)] = \
	preprint_repnum_standardised_categs[categ] \
	+ numeration_match

	# return recorded information about matched report numbers, along with
	# the newly changed working line:
	return repnum_matches_matchlen, repnum_matches_repl_str, line


	def identify_publishers(line, kb_publishers):
	matches_repl = {} # standardised report numbers matched
	# at given locations in line

	- for abbrev in kb_publishers.keys():
	- for match in find_all(line, abbrev):
	+ for abbrev, info in kb_publishers.iteritems():
	+ for match in info['pattern'].finditer(line):
	# record the matched non-standard version of the publisher:
	- matches_repl[match] = abbrev
	+ matches_repl[match.start(0)] = abbrev

	return matches_repl


	def identify_and_tag_URLs(line):
	"""Given a reference line, identify URLs in the line, record the
	information about them, and replace them with a "<cds.URL />" tag.
	URLs are identified in 2 forms:
	+ Raw: http://invenio-software.org/
	+ HTML marked-up: <a href="http://invenio-software.org/">CERN Document
	Server Software Consortium</a>
	These URLs are considered to have 2 components: The URL itself
	(url string); and the URL description. The description is effectively
	the text used for the created Hyperlink when the URL is marked-up
	in HTML. When an HTML marked-up URL has been recognised, the text
	between the anchor tags is therefore taken as the URL description.
	In the case of a raw URL recognition, however, the URL itself will
	also be used as the URL description.
	For example, in the following reference line:
	[1] See <a href="http://invenio-software.org/">CERN Document Server
	Software Consortium</a>.
	...the URL string will be "http://invenio-software.org/" and the URL
	description will be
	"CERN Document Server Software Consortium".
	The line returned from this function will be:
	[1] See <cds.URL />
	In the following line, however:
	[1] See http //invenio-software.org/ for more details.
	...the URL string will be "http://invenio-software.org/" and the URL
	description will also be "http://invenio-software.org/".
	The line returned will be:
	[1] See <cds.URL /> for more details.

	@param line: (string) the reference line in which to search for URLs.
	@return: (tuple) - containing 2 items:
	+ the line after URLs have been recognised and removed;
	+ a list of 2-item tuples where each tuple represents a recognised URL
	and its description:
	[(url, url-description), (url, url-description), ... ]
	@Exceptions raised:
	+ an IndexError if there is a problem with the number of URLs
	recognised (this should not happen.)
	"""
	# Take a copy of the line:
	line_pre_url_check = line
	# Dictionaries to record details of matched URLs:
	found_url_full_matchlen = {}
	found_url_urlstring = {}
	found_url_urldescr = {}

	# List to contain details of all matched URLs:
	identified_urls = []

	# Attempt to identify and tag all HTML-MARKED-UP URLs in the line:
	m_tagged_url_iter = re_html_tagged_url.finditer(line)
	for m_tagged_url in m_tagged_url_iter:
	startposn = m_tagged_url.start() # start position of matched URL
	endposn = m_tagged_url.end() # end position of matched URL
	matchlen = len(m_tagged_url.group(0)) # total length of URL match

	found_url_full_matchlen[startposn] = matchlen
	- found_url_urlstring[startposn] = m_tagged_url.group(3)
	- found_url_urldescr[startposn] = m_tagged_url.group(15)
	+ found_url_urlstring[startposn] = m_tagged_url.group('url')
	+ found_url_urldescr[startposn] = m_tagged_url.group('desc')
	# temporarily replace the URL match with underscores so that
	# it won't be re-found
	line = line[0:startposn] + u"_"*matchlen + line[endposn:]

	# Attempt to identify and tag all RAW (i.e. not
	# HTML-marked-up) URLs in the line:
	m_raw_url_iter = re_raw_url.finditer(line)
	for m_raw_url in m_raw_url_iter:
	startposn = m_raw_url.start() # start position of matched URL
	endposn = m_raw_url.end() # end position of matched URL
	matchlen = len(m_raw_url.group(0)) # total length of URL match
	- matched_url = m_raw_url.group(1)
	+ matched_url = m_raw_url.group('url')

	if len(matched_url) > 0 and matched_url[-1] in (".", ","):
	# Strip the full-stop or comma from the end of the url:
	matched_url = matched_url[:-1]

	found_url_full_matchlen[startposn] = matchlen
	found_url_urlstring[startposn] = matched_url
	found_url_urldescr[startposn] = matched_url
	# temporarily replace the URL match with underscores
	# so that it won't be re-found
	line = line[0:startposn] + u"_"*matchlen + line[endposn:]

	# Now that all URLs have been identified, insert them
	# back into the line, tagged:
	found_url_positions = found_url_urlstring.keys()
	found_url_positions.sort()
	found_url_positions.reverse()
	for url_position in found_url_positions:
	line = line[0:url_position] + "<cds.URL />" \
	+ line[url_position + found_url_full_matchlen[url_position]:]

	# The line has been rebuilt. Now record the information about the
	# matched URLs:
	found_url_positions = found_url_urlstring.keys()
	found_url_positions.sort()
	for url_position in found_url_positions:
	identified_urls.append((found_url_urlstring[url_position], \
	found_url_urldescr[url_position]))

	# Somehow the number of URLs found doesn't match the number of
	# URLs recorded in "identified_urls". Raise an IndexError.
	msg = """Error: The number of URLs found in the reference line """ \
	"""does not match the number of URLs recorded in the """ \
	"""list of identified URLs!\nLine pre-URL checking: %s\n""" \
	"""Line post-URL checking: %s\n""" \
	% (line_pre_url_check, line)
	assert len(identified_urls) == len(found_url_positions), msg

	# return the line containing the tagged URLs:
	return line, identified_urls


	def identify_and_tag_DOI(line):
	"""takes a single citation line and attempts to locate any DOI references.
	DOI references are recognised in both http (url) format and also the
	standard DOI notation (DOI: ...)
	@param line: (string) the reference line in which to search for DOI's.
	@return: the tagged line and a list of DOI strings (if any)
	"""
	# Used to hold the DOI strings in the citation line
	doi_strings = []

	# Run the DOI pattern on the line, returning the re.match objects
	matched_doi = re_doi.finditer(line)
	# For each match found in the line
	- for match in matched_doi:
	+ for match in reversed(list(matched_doi)):
	# Store the start and end position
	start = match.start()
	end = match.end()
	# Get the actual DOI string (remove the url part of the doi string)
	doi_phrase = match.group(6)

	# Replace the entire matched doi with a tag
	line = line[0:start] + "<cds.DOI />" + line[end:]
	# Add the single DOI string to the list of DOI strings
	doi_strings.append(doi_phrase)

	+ doi_strings.reverse()
	return line, doi_strings
	diff --git a/modules/docextract/lib/refextract_task.py b/modules/docextract/lib/refextract_task.py
	index dfee15ba5..7c3b5141c 100644
	--- a/modules/docextract/lib/refextract_task.py
	+++ b/modules/docextract/lib/refextract_task.py
	@@ -1,239 +1,251 @@
	# -- coding: utf-8 --
	##
	## This file is part of Invenio.
	## Copyright (C) 2011 CERN.
	##
	## Invenio is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## Invenio is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	"""
	Refextract task

	Sends references to parse through bibsched
	"""

	import sys
	+from datetime import datetime, timedelta

	from invenio.bibtask import task_init, task_set_option, \
	task_get_option, write_message
	from invenio.config import CFG_VERSION, \
	CFG_SITE_SECURE_URL, \
	CFG_BIBCATALOG_SYSTEM, \
	CFG_REFEXTRACT_TICKET_QUEUE
	+from invenio.dbquery import run_sql
	from invenio.search_engine import perform_request_search
	# Help message is the usage() print out of how to use Refextract
	from invenio.refextract_cli import HELP_MESSAGE, DESCRIPTION
	from invenio.refextract_api import update_references, \
	FullTextNotAvailable, \
	RecordHasReferences
	-from invenio.docextract_task import task_run_core_wrapper, split_ids
	+from invenio.docextract_task import task_run_core_wrapper, \
	+ split_ids
	+from invenio.docextract_utils import setup_loggers
	from invenio.bibcatalog_system_rt import BibCatalogSystemRT
	from invenio.bibedit_utils import get_bibrecord
	from invenio.bibrecord import record_get_field_instances, \
	field_get_subfield_values


	def check_options():
	""" Reimplement this method for having the possibility to check options
	before submitting the task, in order for example to provide default
	values. It must return False if there are errors in the options.
	"""
	if not task_get_option('new') \
	and not task_get_option('modified') \
	and not task_get_option('recids') \
	and not task_get_option('collections') \
	and not task_get_option('arxiv'):
	print >>sys.stderr, 'Error: No records specified, you need' \
	' to specify which files to run on'
	return False

	return True


	def cb_parse_option(key, value, opts, args):
	""" Must be defined for bibtask to create a task """
	if args and len(args) > 0:
	# There should be no standalone arguments for any refextract job
	# This will catch args before the job is shipped to Bibsched
	raise StandardError("Error: Unrecognised argument '%s'." % args[0])

	if key in ('-a', '--new'):
	task_set_option('new', True)
	task_set_option('no-overwrite', True)
	elif key in ('-m', '--modified'):
	task_set_option('modified', True)
	task_set_option('no-overwrite', True)
	elif key in ('-i', '--inspire', ):
	task_set_option('inspire', True)
	elif key in ('--kb-reports', ):
	task_set_option('kb-reports', value)
	elif key in ('--kb-journals', ):
	task_set_option('kb-journals', value)
	elif key in ('--kb-journals-re', ):
	task_set_option('kb-journals-re', value)
	elif key in ('--kb-authors', ):
	task_set_option('kb-authors', value)
	elif key in ('--kb-books', ):
	task_set_option('kb-books', value)
	elif key in ('--kb-conferences', ):
	task_set_option('kb-conferences', value)
	elif key in ('--create-ticket', ):
	task_set_option('create-ticket', True)
	elif key in ('--no-overwrite', ):
	task_set_option('no-overwrite', True)
	elif key in ('--arxiv'):
	task_set_option('arxiv', True)
	elif key in ('-c', '--collections'):
	collections = task_get_option('collections')
	if not collections:
	collections = set()
	task_set_option('collections', collections)
	for v in value.split(","):
	collections.update(perform_request_search(c=v))
	elif key in ('-r', '--recids'):
	recids = task_get_option('recids')
	if not recids:
	recids = set()
	task_set_option('recids', recids)
	recids.update(split_ids(value))

	return True


	def create_ticket(recid, bibcatalog_system, queue=CFG_REFEXTRACT_TICKET_QUEUE):
	write_message('bibcatalog_system %s' % bibcatalog_system, verbose=1)
	write_message('queue %s' % queue, verbose=1)
	if bibcatalog_system and queue:

	subject = "Refs for #%s" % recid

	# Add report number in the subjecet
	report_number = ""
	record = get_bibrecord(recid)

	in_hep = False
	for collection_tag in record_get_field_instances(record, "980"):
	for collection in field_get_subfield_values(collection_tag, 'a'):
	if collection == 'HEP':
	in_hep = True

	# Only create tickets for HEP
	if not in_hep:
	write_message("not in hep", verbose=1)
	return

	+ # Do not create tickets for old records
	+ creation_date = run_sql("""SELECT creation_date FROM bibrec
	+ WHERE id = %s""", [recid])[0][0]
	+ if creation_date < datetime.now() - timedelta(days=365*2):
	+ return
	+
	for report_tag in record_get_field_instances(record, "037"):
	for category in field_get_subfield_values(report_tag, 'c'):
	if category.startswith('astro-ph'):
	write_message("astro-ph", verbose=1)
	# We do not curate astro-ph
	return

	for report_number in field_get_subfield_values(report_tag, 'a'):
	subject += " " + report_number
	break

	text = '%s/record/edit/#state=edit&recid=%s' % (CFG_SITE_SECURE_URL, \
	recid)
	bibcatalog_system.ticket_submit(subject=subject,
	queue=queue,
	text=text,
	recordid=recid)


	def task_run_core(recid, bibcatalog_system=None, _arxiv=False):
	+ setup_loggers(None, use_bibtask=True)
	+
	if _arxiv:
	overwrite = True
	else:
	overwrite = not task_get_option('no-overwrite')

	try:
	update_references(recid,
	overwrite=overwrite)
	msg = "Extracted references for %s" % recid
	if overwrite:
	write_message("%s (overwrite)" % msg)
	else:
	write_message(msg)

	# Create a RT ticket if necessary
	if not _arxiv and task_get_option('new') \
	or task_get_option('create-ticket'):
	write_message("Checking if we should create a ticket", verbose=1)
	create_ticket(recid, bibcatalog_system)
	except FullTextNotAvailable:
	write_message("No full text available for %s" % recid)
	except RecordHasReferences:
	write_message("Record %s has references, skipping" % recid)


	def main():
	"""Constructs the refextract bibtask."""
	if CFG_BIBCATALOG_SYSTEM == 'RT':
	bibcatalog_system = BibCatalogSystemRT()
	else:
	bibcatalog_system = None

	extra_vars = {'bibcatalog_system': bibcatalog_system}
	# Build and submit the task
	task_init(authorization_action='runrefextract',
	authorization_msg="Refextract Task Submission",
	description=DESCRIPTION,
	# get the global help_message variable imported from refextract.py
	help_specific_usage=HELP_MESSAGE + """

	Scheduled (daemon) options:
	-a, --new Run on all newly inserted records.
	-m, --modified Run on all newly modified records.
	-r, --recids Record id for extraction.
	-c, --collections Entire Collection for extraction.
	--arxiv All arxiv modified records within last week

	Special (daemon) options:
	--create-ticket Create a RT ticket for record references

	Examples:
	(run a daemon job)
	refextract -a
	(run on a set of records)
	refextract --recids 1,2 -r 3
	(run on a collection)
	refextract --collections "Reports"
	(run as standalone)
	refextract -o /home/chayward/refs.xml /home/chayward/thesis.pdf

	""",
	version="Invenio v%s" % CFG_VERSION,
	specific_params=("hVv:x:r:c:nai",
	["help",
	"version",
	"verbose=",
	"inspire",
	"kb-journals=",
	"kb-journals-re=",
	"kb-report-numbers=",
	"kb-authors=",
	"kb-books=",
	"recids=",
	"collections=",
	"new",
	"modified",
	"no-overwrite",
	"arxiv",
	"create-ticket"]),
	task_submit_elaborate_specific_parameter_fnc=cb_parse_option,
	task_submit_check_options_fnc=check_options,
	task_run_fnc=task_run_core_wrapper('refextract',
	task_run_core,
	extra_vars=extra_vars))
	diff --git a/modules/docextract/lib/refextract_text.py b/modules/docextract/lib/refextract_text.py
	index 8ddea2973..443af7a71 100644
	--- a/modules/docextract/lib/refextract_text.py
	+++ b/modules/docextract/lib/refextract_text.py
	@@ -1,332 +1,333 @@
	# -- coding: utf-8 --
	##
	## This file is part of Invenio.
	## Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010, 2011 CERN.
	##
	## Invenio is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## Invenio is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	import re

	from invenio.docextract_pdf import replace_undesirable_characters
	from invenio.docextract_utils import write_message

	from invenio.docextract_text import join_lines, \
	repair_broken_urls, \
	re_multiple_space, \
	remove_page_boundary_lines
	from invenio.refextract_config import CFG_REFEXTRACT_MAX_LINES
	from invenio.refextract_find import find_end_of_reference_section, \
	get_reference_section_beginning


	def extract_references_from_fulltext(fulltext):
	"""Locate and extract the reference section from a fulltext document.
	Return the extracted reference section as a list of strings, whereby each
	string in the list is considered to be a single reference line.
	E.g. a string could be something like:
	'[19] Wilson, A. Unpublished (1986).
	@param fulltext: (list) of strings, whereby each string is a line of the
	document.
	@return: (list) of strings, where each string is an extracted reference
	line.
	"""
	# Try to remove pagebreaks, headers, footers
	fulltext = remove_page_boundary_lines(fulltext)
	status = 0
	# How ref section found flag
	how_found_start = 0
	# Find start of refs section
	ref_sect_start = get_reference_section_beginning(fulltext)

	if ref_sect_start is None:
	## No References
	refs = []
	status = 4
	- write_message("* extract_references_from_fulltext: " \
	+ write_message("* extract_references_from_fulltext: "
	"ref_sect_start is None", verbose=2)
	else:
	# If a reference section was found, however weak
	ref_sect_end = \
	find_end_of_reference_section(fulltext,
	ref_sect_start["start_line"],
	ref_sect_start["marker"],
	ref_sect_start["marker_pattern"])
	if ref_sect_end is None:
	# No End to refs? Not safe to extract
	refs = []
	status = 5
	- write_message("* extract_references_from_fulltext: " \
	+ write_message("* extract_references_from_fulltext: "
	"no end to refs!", verbose=2)
	else:
	# If the end of the reference section was found.. start extraction
	refs = get_reference_lines(fulltext,
	ref_sect_start["start_line"],
	ref_sect_end,
	ref_sect_start["title_string"],
	ref_sect_start["marker_pattern"],
	- ref_sect_start["title_marker_same_line"],
	- ref_sect_start["marker"])
	+ ref_sect_start["title_marker_same_line"])

	return refs, status, how_found_start


	def get_reference_lines(docbody,
	ref_sect_start_line,
	ref_sect_end_line,
	ref_sect_title,
	ref_line_marker_ptn,
	- title_marker_same_line,
	- ref_line_marker):
	+ title_marker_same_line):
	"""After the reference section of a document has been identified, and the
	first and last lines of the reference section have been recorded, this
	function is called to take the reference lines out of the document body.
	The document's reference lines are returned in a list of strings whereby
	each string is a reference line. Before this can be done however, the
	reference section is passed to another function that rebuilds any broken
	reference lines.
	@param docbody: (list) of strings - the entire document body.
	@param ref_sect_start_line: (integer) - the index in docbody of the first
	reference line.
	@param ref_sect_end_line: (integer) - the index in docbody of the last
	reference line.
	@param ref_sect_title: (string) - the title of the reference section
	(e.g. "References").
	@param ref_line_marker_ptn: (string) - the patern used to match the
	marker for each reference line (e.g., could be used to match lines
	with markers of the form [1], [2], etc.)
	@param title_marker_same_line: (integer) - a flag to indicate whether
	or not the reference section title was on the same line as the first
	reference line's marker.
	@return: (list) of strings. Each string is a reference line, extracted
	from the document.
	"""
	start_idx = ref_sect_start_line
	if title_marker_same_line:
	# Title on same line as 1st ref- take title out!
	title_start = docbody[start_idx].find(ref_sect_title)
	if title_start != -1:
	# Set the first line with no title
	- docbody[start_idx] = docbody[start_idx][title_start + \
	+ docbody[start_idx] = docbody[start_idx][title_start +
	len(ref_sect_title):]
	elif ref_sect_title is not None:
	# Set the start of the reference section to be after the title line
	start_idx += 1

	if ref_sect_end_line is not None:
	ref_lines = docbody[start_idx:ref_sect_end_line+1]
	else:
	ref_lines = docbody[start_idx:]

	if ref_sect_title:
	ref_lines = strip_footer(ref_lines, ref_sect_title)
	- if not ref_line_marker or not ref_line_marker.isdigit():
	- ref_lines = strip_pagination(ref_lines)
	# Now rebuild reference lines:
	# (Go through each raw reference line, and format them into a set
	# of properly ordered lines based on markers)
	return rebuild_reference_lines(ref_lines, ref_line_marker_ptn)


	-def strip_pagination(ref_lines):
	+def match_pagination(ref_line):
	"""Remove footer pagination from references lines"""
	- pattern = ur'$?\[?\d{0,3}\]?$?\.?\s*$'
	+ pattern = ur'$?\[?(\d{1,4})\]?$?\.?\s*$'
	re_footer = re.compile(pattern, re.UNICODE)
	- return [l for l in ref_lines if not re_footer.match(l)]
	+ match = re_footer.match(ref_line)
	+ if match:
	+ return int(match.group(1))
	+ return None


	def strip_footer(ref_lines, section_title):
	"""Remove footer title from references lines"""
	pattern = ur'$?\[?\d{0,4}\]?$?\.?\s%s\s$' % re.escape(section_title)
	re_footer = re.compile(pattern, re.UNICODE)
	return [l for l in ref_lines if not re_footer.match(l)]


	def rebuild_reference_lines(ref_sectn, ref_line_marker_ptn):
	"""Given a reference section, rebuild the reference lines. After translation
	from PDF to text, reference lines are often broken. This is because
	pdftotext doesn't know what is a wrapped-line and what is a genuine new
	line. As a result, the following 2 reference lines:
	[1] See http://invenio-software.org/ for more details.
	[2] Example, AN: private communication (1996).
	...could be broken into the following 4 lines during translation from PDF
	to plaintext:
	[1] See http://invenio-software.org/ fo
	r more details.
	[2] Example, AN: private communica
	tion (1996).
	Such a situation could lead to a citation being separated across 'lines',
	meaning that it wouldn't be correctly recognised.
	This function tries to rebuild the reference lines. It uses the pattern
	used to recognise a reference line's numeration marker to indicate the
	start of a line. If no reference line numeration was recognised, it will
	simply join all lines together into one large reference line.
	@param ref_sectn: (list) of strings. The (potentially broken) reference
	lines.
	@param ref_line_marker_ptn: (string) - the pattern used to recognise a
	reference line's numeration marker.
	@return: (list) of strings - the rebuilt reference section. Each string
	in the list represents a complete reference line.
	"""
	- ## initialise some vars:
	- rebuilt_references = []
	- working_ref = []
	-
	- strip_before = True
	- if ref_line_marker_ptn is None or \
	- type(ref_line_marker_ptn) not in (str, unicode):
	+ # This should be moved the function detecting the pattern!
	+ if not ref_line_marker_ptn:
	if test_for_blank_lines_separating_reference_lines(ref_sectn):
	- ## Use blank lines to separate ref lines
	+ # Use blank lines to separate ref lines
	ref_line_marker_ptn = ur'^\s*$'
	else:
	- ## No ref line dividers: unmatchable pattern
	- #ref_line_marker_ptn = ur'^A$^A$$'
	- # I am adding a new format, hopefully
	- # this case wasn't useful
	+ # No ref line dividers
	+ # We are guessing this the format:
	# Reference1
	# etc
	# Reference2
	# etc
	# We split when there's no identation
	ref_line_marker_ptn = ur'^[^\s]'
	- strip_before = False

	write_message('* references separator %s' % ref_line_marker_ptn, verbose=2)
	p_ref_line_marker = re.compile(ref_line_marker_ptn, re.I\|re.UNICODE)
	- # Work backwards, starting from the last 'broken' reference line
	+
	+ # Start from ref 1
	# Append each fixed reference line to rebuilt_references
	- current_ref = None
	- line_counter = 0
	+ # and rebuild references as we go
	+ current_ref = 0
	+ rebuilt_references = []
	+ working_ref = []

	def prepare_ref(working_ref):
	+ working_ref = working_ref[:CFG_REFEXTRACT_MAX_LINES]
	working_line = ""
	- for l in reversed(working_ref):
	- working_line = join_lines(working_line, l)
	+ for l in working_ref:
	+ working_line = join_lines(working_line, l.strip())
	working_line = working_line.rstrip()
	return wash_and_repair_reference_line(working_line)

	- for line in reversed(ref_sectn):
	+ for line in ref_sectn:
	+ # Can't find a good way to distinguish between
	+ # pagination and the page number of a journal numeration that
	+ # happens to be alone in a new line
	+ # m = match_pagination(line)
	+ # if m and current_ref and current_ref != m + 1:
	+ # continue
	+
	# Try to find the marker for the reference line
	- if strip_before:
	- current_string = line.strip()
	- m_ref_line_marker = p_ref_line_marker.search(current_string)
	- else:
	- m_ref_line_marker = p_ref_line_marker.search(line)
	- current_string = line.strip()
	-
	- if m_ref_line_marker and (not current_ref \
	- or current_ref == int(m_ref_line_marker.group('marknum')) + 1):
	- # Reference line marker found! : Append this reference to the
	- # list of fixed references and reset the working_line to 'blank'
	- if current_string != '':
	- ## If it's not a blank line to separate refs
	- working_ref.append(current_string)
	- # Append current working line to the refs list
	- if line_counter < CFG_REFEXTRACT_MAX_LINES:
	- rebuilt_references.append(prepare_ref(working_ref))
	+ m_ref_line_marker = p_ref_line_marker.search(line)
	+
	+ if m_ref_line_marker:
	try:
	- current_ref = int(m_ref_line_marker.group('marknum'))
	+ marknum = int(m_ref_line_marker.group('marknum'))
	except IndexError:
	- pass # this line doesn't have numbering
	- working_ref = []
	- line_counter = 0
	- elif current_string != u'':
	+ marknum = None
	+ if marknum is None or current_ref + 1 == marknum:
	+ # Reference line marker found! : Append this reference to the
	+ # list of fixed references and reset the working_line to 'blank'
	+ start = m_ref_line_marker.start()
	+ if line[:start]:
	+ # If it's not a blank line to separate refs
	+ # Only append from the start of the marker
	+ # For this case:
	+ # [1] hello
	+ # hello2 [2] foo
	+ working_ref.append(line[:start])
	+
	+ # Append current working line to the refs list
	+ if working_ref:
	+ rebuilt_references.append(prepare_ref(working_ref))
	+
	+ current_ref = marknum
	+ working_ref = []
	+ if line[start:]:
	+ working_ref.append(line[start:])
	+
	+ else:
	+ # Our marker does not match the counting
	+ # Either we missed one, the author missed one or
	+ # it is not a line marker
	+ # For now we assume it is not line marker
	+ working_ref.append(line)
	+
	+ elif line:
	# Continuation of line
	- working_ref.append(current_string)
	- line_counter += 1
	+ working_ref.append(line)

	if working_ref:
	# Append last line
	rebuilt_references.append(prepare_ref(working_ref))

	- # A list of reference lines has been built backwards - reverse it:
	- rebuilt_references.reverse()
	-
	- # Make sure mulitple markers within references are correctly
	- # in place (compare current marker num with current marker num +1)
	- # rebuilt_references = correct_rebuilt_lines(rebuilt_references, \
	- # p_ref_line_marker)
	-
	- # For each properly formated reference line, try to identify cases
	- # where there is more than one citation in a single line. This is
	- # done by looking for semi-colons, which could be used to
	- # separate references
	return rebuilt_references


	def wash_and_repair_reference_line(line):
	"""Wash a reference line of undesirable characters (such as poorly-encoded
	letters, etc), and repair any errors (such as broken URLs) if possible.
	@param line: (string) the reference line to be washed/repaired.
	@return: (string) the washed reference line.
	"""
	# repair URLs in line:
	line = repair_broken_urls(line)
	# Replace various undesirable characters with their alternatives:
	line = replace_undesirable_characters(line)
	# Replace "<title>," with "<title>",
	# common typing mistake
	line = re.sub(ur'"([^"]+),"', ur'"\g<1>",', line)
	line = replace_undesirable_characters(line)
	# Remove instances of multiple spaces from line, replacing with a
	# single space:
	line = re_multiple_space.sub(u' ', line)
	return line


	def test_for_blank_lines_separating_reference_lines(ref_sect):
	"""Test to see if reference lines are separated by blank lines so that
	these can be used to rebuild reference lines.
	@param ref_sect: (list) of strings - the reference section.
	@return: (int) 0 if blank lines do not separate reference lines; 1 if
	they do.
	"""
	num_blanks = 0 # Number of blank lines found between non-blanks
	num_lines = 0 # Number of reference lines separated by blanks
	blank_line_separators = 0 # Flag to indicate whether blanks lines separate
	# ref lines
	multi_nonblanks_found = 0 # Flag to indicate whether multiple nonblank
	# lines are found together (used because
	# if line is dbl-spaced, it isnt a blank that
	# separates refs & can't be relied upon)
	x = 0
	max_line = len(ref_sect)
	while x < max_line:
	if not ref_sect[x].isspace():
	# not an empty line:
	num_lines += 1
	x += 1 # Move past line
	while x < len(ref_sect) and not ref_sect[x].isspace():
	multi_nonblanks_found = 1
	x += 1
	x -= 1
	else:
	# empty line
	num_blanks += 1
	x += 1
	while x < len(ref_sect) and ref_sect[x].isspace():
	x += 1
	if x == len(ref_sect):
	# Blanks at end doc: dont count
	num_blanks -= 1
	x -= 1
	x += 1
	# Now from the number of blank lines & the number of text lines, if
	# num_lines > 3, & num_blanks = num_lines, or num_blanks = num_lines - 1,
	# then we have blank line separators between reference lines
	- if (num_lines > 3) and ((num_blanks == num_lines) or \
	+ if (num_lines > 3) and ((num_blanks == num_lines) or
	(num_blanks == num_lines - 1)) and \
	(multi_nonblanks_found):
	blank_line_separators = 1
	return blank_line_separators
	diff --git a/modules/docextract/lib/refextract_unit_tests.py b/modules/docextract/lib/refextract_unit_tests.py
	index 1380b09f3..c9dc6ed3c 100644
	--- a/modules/docextract/lib/refextract_unit_tests.py
	+++ b/modules/docextract/lib/refextract_unit_tests.py
	@@ -1,233 +1,309 @@
	# -- coding: utf-8 --
	##
	## This file is part of Invenio.
	## Copyright (C) 2010, 2011, 2013 CERN.
	##
	## Invenio is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## Invenio is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	"""
	The Refextract unit test suite

	The tests will not modifiy the database.
	"""

	from invenio.testutils import InvenioTestCase
	import re

	from invenio.testutils import make_test_suite, run_test_suite
	# Import the minimal necessary methods and variables needed to run Refextract
	from invenio.docextract_utils import setup_loggers
	-from invenio.refextract_tag import identify_ibids, tag_numeration
	+from invenio.refextract_tag import identify_ibids, \
	+ find_numeration, \
	+ find_numeration_more
	from invenio import refextract_re
	from invenio.refextract_find import get_reference_section_beginning
	from invenio.refextract_api import search_from_reference
	+from invenio.refextract_text import rebuild_reference_lines


	class ReTest(InvenioTestCase):
	def setUp(self):
	setup_loggers(verbosity=1)

	def test_word(self):
	r = refextract_re._create_regex_pattern_add_optional_spaces_to_word_characters('ABC')
	self.assertEqual(r, ur'A\sB\sC\s*')

	def test_reference_section_title_pattern(self):
	r = refextract_re.get_reference_section_title_patterns()
	self.assert_(len(r) > 2)

	def test_get_reference_line_numeration_marker_patterns(self):
	r = refextract_re.get_reference_line_numeration_marker_patterns()
	self.assert_(len(r) > 2)

	def test_get_reference_line_marker_pattern(self):
	r = refextract_re.get_reference_line_marker_pattern('ABC')
	self.assertNotEqual(r.pattern.find('ABC'), -1)

	def test_get_post_reference_section_title_patterns(self):
	r = refextract_re.get_post_reference_section_title_patterns()
	self.assert_(len(r) > 2)

	def test_get_post_reference_section_keyword_patterns(self):
	r = refextract_re.get_post_reference_section_keyword_patterns()
	self.assert_(len(r) > 2)

	def test_regex_match_list(self):
	s = 'ABC'
	m = refextract_re.regex_match_list(s, [
	re.compile('C.C'),
	re.compile('A.C')
	])
	self.assert_(m)
	m = refextract_re.regex_match_list(s, [
	re.compile('C.C')
	])
	self.assertEqual(m, None)


	class IbidTest(InvenioTestCase):
	"""Testing output of refextract"""
	def setUp(self):
	setup_loggers(verbosity=1)

	def test_identify_ibids_empty(self):
	r = identify_ibids("")
	- self.assertEqual(r, ({}, {}, ''))
	+ self.assertEqual(r, ({}, ''))

	def test_identify_ibids_simple(self):
	ref_line = u"""[46] E. Schrodinger, Sitzungsber. Preuss. Akad. Wiss. Phys. Math. Kl. 24, 418(1930); ibid, 3, 1(1931)"""
	r = identify_ibids(ref_line.upper())
	- self.assertEqual(r, ({85: 4}, {85: u'IBID'}, u'[46] E. SCHRODINGER, SITZUNGSBER. PREUSS. AKAD. WISS. PHYS. MATH. KL. 24, 418(1930); ____, 3, 1(1931)'))
	+ self.assertEqual(r, ({85: u'IBID'}, u'[46] E. SCHRODINGER, SITZUNGSBER. PREUSS. AKAD. WISS. PHYS. MATH. KL. 24, 418(1930); ____, 3, 1(1931)'))


	-class TagNumerationTest(InvenioTestCase):
	+class FindNumerationTest(InvenioTestCase):
	def setUp(self):
	setup_loggers(verbosity=1)

	def test_vol_page_year(self):
	"<vol>, <page> (<year>)"
	ref_line = u"""24, 418 (1930)"""
	- r = tag_numeration(ref_line)
	- self.assertEqual(r.strip(': '), u"<cds.VOL>24</cds.VOL> <cds.YR>(1930)</cds.YR> <cds.PG>418</cds.PG>")
	+ r = find_numeration(ref_line)
	+ self.assertEqual(r['volume'], u"24")
	+ self.assertEqual(r['year'], u"1930")
	+ self.assertEqual(r['page'], u"418")

	def test_vol_year_page(self):
	"<vol>, (<year>) <page> "
	ref_line = u"""24, (1930) 418"""
	- r = tag_numeration(ref_line)
	- self.assertEqual(r.strip(': '), u"<cds.VOL>24</cds.VOL> <cds.YR>(1930)</cds.YR> <cds.PG>418</cds.PG>")
	+ r = find_numeration(ref_line)
	+ self.assertEqual(r['volume'], u"24")
	+ self.assertEqual(r['year'], u"1930")
	+ self.assertEqual(r['page'], u"418")
	+
	+ def test_year_title_volume_page(self):
	+ "<year>, <title> <vol> <page> "
	+ ref_line = u"""1930 <cds.JOURNAL>J.Phys.</cds.JOURNAL> 24, 418"""
	+ r = find_numeration_more(ref_line)
	+ self.assertEqual(r['volume'], u"24")
	+ self.assertEqual(r['year'], u"1930")
	+ self.assertEqual(r['page'], u"418")


	class FindSectionTest(InvenioTestCase):
	def setUp(self):
	setup_loggers(verbosity=1)

	def test_simple(self):
	sect = get_reference_section_beginning([
	"Hello",
	"References",
	"[1] Ref1"
	])
	self.assertEqual(sect, {
	'marker': '[1]',
	- 'marker_pattern': u'^\\s(?P<mark>\\[\\s(?P<marknum>\\d+)\\s*\\])',
	+ 'marker_pattern': u'\\s(?P<mark>\\[\\s(?P<marknum>\\d+)\\s*\\])',
	'start_line': 1,
	'title_string': 'References',
	'title_marker_same_line': False,
	'how_found_start': 1,
	})

	def test_no_section(self):
	sect = get_reference_section_beginning("")
	self.assertEqual(sect, None)

	def test_no_title_via_brackets(self):
	sect = get_reference_section_beginning([
	"Hello",
	"[1] Ref1"
	"[2] Ref2"
	])
	self.assertEqual(sect, {
	'marker': '[1]',
	'marker_pattern': u'(?P<mark>(?P<left>\\[)\\s(?P<marknum>\\d+)\\s(?P<right>\\]))',
	'start_line': 1,
	'title_string': None,
	'title_marker_same_line': False,
	'how_found_start': 2,
	})

	def test_no_title_via_dots(self):
	sect = get_reference_section_beginning([
	"Hello",
	"1. Ref1"
	"2. Ref2"
	])
	self.assertEqual(sect, {
	'marker': '1.',
	'marker_pattern': u'(?P<mark>(?P<left>)\\s(?P<marknum>\\d+)\\s(?P<right>\\.))',
	'start_line': 1,
	'title_string': None,
	'title_marker_same_line': False,
	'how_found_start': 3,
	})

	def test_no_title_via_numbers(self):
	sect = get_reference_section_beginning([
	"Hello",
	"1 Ref1"
	"2 Ref2"
	])
	self.assertEqual(sect, {
	'marker': '1',
	'marker_pattern': u'(?P<mark>(?P<left>)\\s(?P<marknum>\\d+)\\s(?P<right>))',
	'start_line': 1,
	'title_string': None,
	'title_marker_same_line': False,
	'how_found_start': 4,
	})

	def test_no_title_via_numbers2(self):
	sect = get_reference_section_beginning([
	"Hello",
	"1",
	"Ref1",
	"(3)",
	"2",
	"Ref2",
	])
	self.assertEqual(sect, {
	'marker': '1',
	'marker_pattern': u'(?P<mark>(?P<left>)\\s(?P<marknum>\\d+)\\s(?P<right>))',
	'start_line': 1,
	'title_string': None,
	'title_marker_same_line': False,
	'how_found_start': 4,
	})


	class SearchTest(InvenioTestCase):
	def setUp(self):
	setup_loggers(verbosity=9)
	from invenio import refextract_kbs
	self.old_override = refextract_kbs.CFG_REFEXTRACT_KBS_OVERRIDE
	refextract_kbs.CFG_REFEXTRACT_KBS_OVERRIDE = {}

	def tearDown(self):
	from invenio import refextract_kbs
	refextract_kbs.CFG_REFEXTRACT_KBS_OVERRIDE = self.old_override

	def test_not_recognized(self):
	field, pattern = search_from_reference('[1] J. Mars, oh hello')
	self.assertEqual(field, '')
	self.assertEqual(pattern, '')

	def test_report(self):
	field, pattern = search_from_reference('[1] J. Mars, oh hello, [hep-ph/0104088]')
	self.assertEqual(field, 'report')
	self.assertEqual(pattern, 'hep-ph/0104088')

	def test_journal(self):
	field, pattern = search_from_reference('[1] J. Mars, oh hello, Nucl.Phys. B76 (1974) 477-482')
	self.assertEqual(field, 'journal')
	self.assert_('Nucl' in pattern)
	self.assert_('B76' in pattern)
	self.assert_('477' in pattern)

	+
	+class RebuildReferencesTest(unittest.TestCase):
	+ def setUp(self):
	+ setup_loggers(verbosity=9)
	+
	+ def test_simple(self):
	+ marker_pattern = ur"^\s(?P<mark>\[\s(?P<marknum>\d+)\s*\])"
	+ refs = [
	+ u"[1] hello",
	+ u"hello2",
	+ u"[2] foo",
	+ ]
	+ rebuilt_refs = rebuild_reference_lines(refs, marker_pattern)
	+ self.assertEqual(rebuilt_refs, [
	+ u"[1] hello hello2",
	+ u"[2] foo",
	+ ])
	+
	+ # def test_pagination_removal(self):
	+ # marker_pattern = ur"^\s(?P<mark>\[\s(?P<marknum>\d+)\s*\])"
	+ # refs = [
	+ # u"[1] hello",
	+ # u"hello2",
	+ # u"[42]",
	+ # u"[2] foo",
	+ # ]
	+ # rebuilt_refs = rebuild_reference_lines(refs, marker_pattern)
	+ # self.assertEqual(rebuilt_refs, [
	+ # u"[1] hello hello2",
	+ # u"[2] foo",
	+ # ])
	+
	+ def test_pagination_non_removal(self):
	+ marker_pattern = ur"^\s(?P<mark>\[\s(?P<marknum>\d+)\s*\])"
	+ refs = [
	+ u"[1] hello",
	+ u"hello2",
	+ u"[2]",
	+ u"foo",
	+ ]
	+ rebuilt_refs = rebuild_reference_lines(refs, marker_pattern)
	+ self.assertEqual(rebuilt_refs, [
	+ u"[1] hello hello2",
	+ u"[2] foo",
	+ ])
	+
	+ def test_2_lines_together(self):
	+ marker_pattern = ur"\s(?P<mark>\[\s(?P<marknum>\d+)\s*\])"
	+ refs = [
	+ u"[1] hello",
	+ u"hello2 [2] foo",
	+ ]
	+ rebuilt_refs = rebuild_reference_lines(refs, marker_pattern)
	+ self.assertEqual(rebuilt_refs, [
	+ u"[1] hello hello2",
	+ u"[2] foo",
	+ ])
	+ print 'rebuilt_refs', repr(rebuilt_refs)
	+
	+
	TEST_SUITE = make_test_suite(ReTest,
	IbidTest,
	- TagNumerationTest,
	+ FindNumerationTest,
	FindSectionTest,
	- SearchTest)
	+ SearchTest,
	+ RebuildReferencesTest)

	if __name__ == '__main__':
	run_test_suite(TEST_SUITE)
	diff --git a/modules/docextract/lib/refextract_xml.py b/modules/docextract/lib/refextract_xml.py
	deleted file mode 100644
	index 4f460ddee..000000000
	--- a/modules/docextract/lib/refextract_xml.py
	+++ /dev/null
	@@ -1,713 +0,0 @@
	-# -- coding: utf-8 --
	-##
	-## This file is part of Invenio.
	-## Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010, 2011 CERN.
	-##
	-## Invenio is free software; you can redistribute it and/or
	-## modify it under the terms of the GNU General Public License as
	-## published by the Free Software Foundation; either version 2 of the
	-## License, or (at your option) any later version.
	-##
	-## Invenio is distributed in the hope that it will be useful, but
	-## WITHOUT ANY WARRANTY; without even the implied warranty of
	-## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	-## General Public License for more details.
	-##
	-## You should have received a copy of the GNU General Public License
	-## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	-## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.
	-
	-import re
	-
	-from xml.sax.saxutils import escape as encode_for_xml
	-from datetime import datetime
	-
	-from invenio.refextract_re import re_num
	-from invenio.docextract_utils import write_message
	-from invenio.refextract_config import \
	- CFG_REFEXTRACT_TAG_ID_REFERENCE, \
	- CFG_REFEXTRACT_IND1_REFERENCE, \
	- CFG_REFEXTRACT_IND2_REFERENCE, \
	- CFG_REFEXTRACT_SUBFIELD_MARKER, \
	- CFG_REFEXTRACT_SUBFIELD_AUTH, \
	- CFG_REFEXTRACT_SUBFIELD_TITLE, \
	- CFG_REFEXTRACT_SUBFIELD_MISC, \
	- CGF_REFEXTRACT_SEMI_COLON_MISC_TEXT_SENSITIVITY, \
	- CFG_REFEXTRACT_SUBFIELD_REPORT_NUM, \
	- CFG_REFEXTRACT_XML_RECORD_OPEN, \
	- CFG_REFEXTRACT_CTRL_FIELD_RECID, \
	- CFG_REFEXTRACT_TAG_ID_EXTRACTION_STATS, \
	- CFG_REFEXTRACT_IND1_EXTRACTION_STATS, \
	- CFG_REFEXTRACT_IND2_EXTRACTION_STATS, \
	- CFG_REFEXTRACT_SUBFIELD_EXTRACTION_STATS, \
	- CFG_REFEXTRACT_SUBFIELD_EXTRACTION_TIME, \
	- CFG_REFEXTRACT_SUBFIELD_EXTRACTION_VERSION, \
	- CFG_REFEXTRACT_VERSION, \
	- CFG_REFEXTRACT_XML_RECORD_CLOSE, \
	- CFG_REFEXTRACT_SUBFIELD_URL_DESCR, \
	- CFG_REFEXTRACT_SUBFIELD_URL, \
	- CFG_REFEXTRACT_SUBFIELD_DOI, \
	- CGF_REFEXTRACT_ADJACENT_AUTH_MISC_SEPARATION, \
	- CFG_REFEXTRACT_SUBFIELD_QUOTED, \
	- CFG_REFEXTRACT_SUBFIELD_ISBN, \
	- CFG_REFEXTRACT_SUBFIELD_PUBLISHER, \
	- CFG_REFEXTRACT_SUBFIELD_YEAR, \
	- CFG_REFEXTRACT_SUBFIELD_BOOK
	-
	-from invenio import config
	-CFG_INSPIRE_SITE = getattr(config, 'CFG_INSPIRE_SITE', False)
	-
	-
	-def format_marker(line_marker):
	- if line_marker:
	- num_match = re_num.search(line_marker)
	- if num_match:
	- line_marker = num_match.group(0)
	- return line_marker
	-
	-
	-def create_xml_record(counts, recid, xml_lines, status_code=0):
	- """Given a series of MARC XML-ized reference lines and a record-id, write a
	- MARC XML record to the stdout stream. Include in the record some stats
	- for the extraction job.
	- The printed MARC XML record will essentially take the following
	- structure:
	- <record>
	- <controlfield tag="001">1</controlfield>
	- <datafield tag="999" ind1="C" ind2="5">
	- [...]
	- </datafield>
	- [...]
	- <datafield tag="999" ind1="C" ind2="6">
	- <subfield code="a">
	- Invenio/X.XX.X refextract/X.XX.X-timestamp-err-repnum-title-URL-misc
	- </subfield>
	- </datafield>
	- </record>
	- Timestamp, error(code), reportnum, title, URL, and misc will are of
	- course take the relevant values.
	-
	- @param status_code: (integer)the status of reference-extraction for the
	- given record: was there an error or not? 0 = no error; 1 = error.
	- @param count_reportnum: (integer) - the number of institutional
	- report-number citations found in the document's reference lines.
	- @param count_title: (integer) - the number of journal title citations
	- found in the document's reference lines.
	- @param count_url: (integer) - the number of URL citations found in the
	- document's reference lines.
	- @param count_misc: (integer) - the number of sections of miscellaneous
	- text (i.e. 999C5$m) from the document's reference lines.
	- @param count_auth_group: (integer) - the total number of author groups
	- identified ($h)
	- @param recid: (string) - the record-id of the given document. (put into
	- 001 field.)
	- @param xml_lines: (list) of strings. Each string in the list contains a
	- group of MARC XML 999C5 datafields, making up a single reference line.
	- These reference lines will make up the document body.
	- @return: The entire MARC XML textual output, plus recognition statistics.
	- """
	- out = []
	-
	- ## Start with the opening record tag:
	- out += u"%(record-open)s\n" \
	- % {'record-open': CFG_REFEXTRACT_XML_RECORD_OPEN, }
	-
	- ## Display the record-id controlfield:
	- out += \
	- u""" <controlfield tag="%(cf-tag-recid)s">%(recid)d</controlfield>\n""" \
	- % {'cf-tag-recid' : CFG_REFEXTRACT_CTRL_FIELD_RECID,
	- 'recid' : recid,
	- }
	-
	- ## Loop through all xml lines and add them to the output string:
	- out.extend(xml_lines)
	-
	- ## add the 999C6 status subfields:
	- out += u""" <datafield tag="%(df-tag-ref-stats)s" ind1="%(df-ind1-ref-stats)s" ind2="%(df-ind2-ref-stats)s">
	- <subfield code="%(sf-code-ref-stats)s">%(status)s-%(reportnum)s-%(title)s-%(author)s-%(url)s-%(doi)s-%(misc)s</subfield>
	- <subfield code="%(sf-code-ref-time)s">%(timestamp)s</subfield>
	- <subfield code="%(sf-code-ref-version)s">%(version)s</subfield>
	- </datafield>\n""" \
	- % {'df-tag-ref-stats' : CFG_REFEXTRACT_TAG_ID_EXTRACTION_STATS,
	- 'df-ind1-ref-stats' : CFG_REFEXTRACT_IND1_EXTRACTION_STATS,
	- 'df-ind2-ref-stats' : CFG_REFEXTRACT_IND2_EXTRACTION_STATS,
	- 'sf-code-ref-stats' : CFG_REFEXTRACT_SUBFIELD_EXTRACTION_STATS,
	- 'sf-code-ref-time' : CFG_REFEXTRACT_SUBFIELD_EXTRACTION_TIME,
	- 'sf-code-ref-version': CFG_REFEXTRACT_SUBFIELD_EXTRACTION_VERSION,
	- 'version' : CFG_REFEXTRACT_VERSION,
	- 'timestamp' : datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
	- 'status' : status_code,
	- 'reportnum' : counts['reportnum'],
	- 'title' : counts['title'],
	- 'author' : counts['auth_group'],
	- 'url' : counts['url'],
	- 'doi' : counts['doi'],
	- 'misc' : counts['misc'],
	- }
	-
	- ## Now add the closing tag to the record:
	- out += u"%(record-close)s\n" \
	- % {'record-close' : CFG_REFEXTRACT_XML_RECORD_CLOSE, }
	-
	- ## Be sure to call this BEFORE compress_subfields
	- out = filter_processed_references(''.join(out))
	- ## Compress mulitple 'm' subfields in a datafield
	- out = compress_subfields(out, CFG_REFEXTRACT_SUBFIELD_MISC)
	- ## Compress multiple 'h' subfields in a datafield
	- out = compress_subfields(out, CFG_REFEXTRACT_SUBFIELD_AUTH)
	- return out
	-
	-
	-def build_xml_citations(splitted_citations, line_marker):
	- return [build_xml_citation(citation_elements, line_marker) \
	- for citation_elements in splitted_citations]
	-
	-
	-def build_xml_citation(citation_elements, line_marker, inspire_format=None):
	- """ Create the MARC-XML string of the found reference information which was taken
	- from a tagged reference line.
	- @param citation_elements: (list) an ordered list of dictionary elements,
	- with each element corresponding to a found piece of information from a reference line.
	- @param line_marker: (string) The line marker for this single reference line (e.g. [19])
	- @return xml_line: (string) The MARC-XML representation of the list of reference elements
	- """
	- if inspire_format is None:
	- inspire_format = CFG_INSPIRE_SITE
	-
	- ## Begin the datafield element
	- xml_line = start_datafield_element(line_marker)
	-
	- ## This will hold the ordering of tags which have been appended to the xml line
	- ## This list will be used to control the desisions involving the creation of new citation lines
	- ## (in the event of a new set of authors being recognised, or strange title ordering...)
	- line_elements = []
	-
	- ## This is a list which will hold the current 'over-view' of a single reference line,
	- ## as a list of lists, where each list corresponds to the contents of a datafield element
	- ## in the xml mark-up
	- citation_structure = []
	- auth_for_ibid = None
	-
	- for element in citation_elements:
	- ## Before going onto checking 'what' the next element is, handle misc text and semi-colons
	- ## Multiple misc text subfields will be compressed later
	- ## This will also be the only part of the code that deals with MISC tag_typed elements
	- if element['misc_txt'].strip(".,:;- []"):
	- xml_line = append_subfield_element(xml_line,
	- CFG_REFEXTRACT_SUBFIELD_MISC,
	- element['misc_txt'].strip(".,:;- []"))
	-
	- # Now handle the type dependent actions
	- # TITLE
	- if element['type'] == "JOURNAL":
	-
	- # Select the journal title output format
	- if inspire_format:
	- # ADD to current datafield
	- xml_line += """
	- <subfield code="%(sf-code-ref-title)s">%(title)s,%(volume)s,%(page)s</subfield>""" \
	- % {'sf-code-ref-title': CFG_REFEXTRACT_SUBFIELD_TITLE,
	- 'title' : encode_for_xml(element['title']),
	- 'volume' : encode_for_xml(element['volume']),
	- 'page' : encode_for_xml(element['page']),
	- }
	- else:
	- # ADD to current datafield
	- xml_line += """
	- <subfield code="%(sf-code-ref-title)s">%(title)s %(volume)s (%(year)s) %(page)s</subfield>""" \
	- % {'sf-code-ref-title': CFG_REFEXTRACT_SUBFIELD_TITLE,
	- 'title' : encode_for_xml(element['title']),
	- 'volume' : encode_for_xml(element['volume']),
	- 'year' : encode_for_xml(element['year']),
	- 'page' : encode_for_xml(element['page']),
	- }
	-
	- # Now, if there are any extra (numeration based) IBID's after this title
	- if len(element['extra_ibids']) > 0:
	- # At least one IBID is present, these are to be outputted each into their own datafield
	- for ibid in element['extra_ibids']:
	- # %%%%% Set as NEW citation line %%%%%
	- (xml_line, auth_for_ibid) = append_datafield_element(line_marker,
	- citation_structure,
	- line_elements,
	- auth_for_ibid,
	- xml_line)
	- if inspire_format:
	- xml_line += """
	- <subfield code="%(sf-code-ref-title)s">%(title)s,%(volume)s,%(page)s</subfield>""" \
	- % {'sf-code-ref-title': CFG_REFEXTRACT_SUBFIELD_TITLE,
	- 'title' : encode_for_xml(ibid['title']),
	- 'volume' : encode_for_xml(ibid['volume']),
	- 'page' : encode_for_xml(ibid['page']),
	- }
	- else:
	- xml_line += """
	- <subfield code="%(sf-code-ref-title)s">%(title)s %(volume)s (%(year)s) %(page)s</subfield>""" \
	- % {'sf-code-ref-title': CFG_REFEXTRACT_SUBFIELD_TITLE,
	- 'title' : encode_for_xml(ibid['title']),
	- 'volume' : encode_for_xml(ibid['volume']),
	- 'year' : encode_for_xml(ibid['year']),
	- 'page' : encode_for_xml(ibid['page']),
	- }
	- # Add a Title element to the past elements list, since we last found an IBID
	- line_elements.append(element)
	-
	- # REPORT NUMBER
	- elif element['type'] == "REPORTNUMBER":
	- # ADD to current datafield
	- xml_line = append_subfield_element(xml_line,
	- CFG_REFEXTRACT_SUBFIELD_REPORT_NUM,
	- element['report_num'])
	- line_elements.append(element)
	-
	- # URL
	- elif element['type'] == "URL":
	- if element['url_string'] == element['url_desc']:
	- # Build the datafield for the URL segment of the reference line:
	- xml_line = append_subfield_element(xml_line,
	- CFG_REFEXTRACT_SUBFIELD_URL,
	- element['url_string'])
	- # Else, in the case that the url string and the description differ in some way, include them both
	- else:
	- # Build the datafield for the URL segment of the reference line:
	- xml_line += """
	- <subfield code="%(sf-code-ref-url)s">%(url)s</subfield>
	- <subfield code="%(sf-code-ref-url-desc)s">%(url-desc)s</subfield>""" \
	- % {'sf-code-ref-url' : CFG_REFEXTRACT_SUBFIELD_URL,
	- 'sf-code-ref-url-desc': CFG_REFEXTRACT_SUBFIELD_URL_DESCR,
	- 'url' : encode_for_xml(element['url_string']),
	- 'url-desc' : encode_for_xml(element['url_desc'])
	- }
	- line_elements.append(element)
	-
	- # DOI
	- elif element['type'] == "DOI":
	- ## Split on hitting another DOI in the same line
	- if is_in_line_elements("DOI", line_elements):
	- ## %%%%% Set as NEW citation line %%%%%
	- xml_line, auth_for_ibid = append_datafield_element(line_marker,
	- citation_structure,
	- line_elements,
	- auth_for_ibid,
	- xml_line)
	- xml_line = append_subfield_element(xml_line,
	- CFG_REFEXTRACT_SUBFIELD_DOI,
	- element['doi_string'])
	- line_elements.append(element)
	-
	- # AUTHOR
	- elif element['type'] == "AUTH":
	- value = element['auth_txt']
	- if element['auth_type'] == 'incl':
	- value = "(%s)" % value
	-
	- if is_in_line_elements("AUTH", line_elements) and line_elements[-1]['type'] != "AUTH":
	- xml_line = append_subfield_element(xml_line,
	- CFG_REFEXTRACT_SUBFIELD_MISC,
	- value)
	- else:
	- xml_line = append_subfield_element(xml_line,
	- CFG_REFEXTRACT_SUBFIELD_AUTH,
	- value)
	- line_elements.append(element)
	-
	- elif element['type'] == "QUOTED":
	- xml_line = append_subfield_element(xml_line,
	- CFG_REFEXTRACT_SUBFIELD_QUOTED,
	- element['title'])
	- line_elements.append(element)
	-
	- elif element['type'] == "ISBN":
	- xml_line = append_subfield_element(xml_line,
	- CFG_REFEXTRACT_SUBFIELD_ISBN,
	- element['ISBN'])
	- line_elements.append(element)
	-
	- elif element['type'] == "BOOK":
	- xml_line = append_subfield_element(xml_line,
	- CFG_REFEXTRACT_SUBFIELD_QUOTED,
	- element['title'])
	- xml_line += '\n <subfield code="%s" />' % \
	- CFG_REFEXTRACT_SUBFIELD_BOOK
	- line_elements.append(element)
	-
	- elif element['type'] == "PUBLISHER":
	- xml_line = append_subfield_element(xml_line,
	- CFG_REFEXTRACT_SUBFIELD_PUBLISHER,
	- element['publisher'])
	- line_elements.append(element)
	-
	- elif element['type'] == "YEAR":
	- xml_line = append_subfield_element(xml_line,
	- CFG_REFEXTRACT_SUBFIELD_YEAR,
	- element['year'])
	- line_elements.append(element)
	-
	- # Append the author, if needed for an ibid, for the last element
	- # in the entire line. Don't bother setting the author to be used
	- # for ibids, since the line is finished
	- xml_line += check_author_for_ibid(line_elements, auth_for_ibid)[0]
	-
	- # Close the ending datafield element
	- xml_line += "\n </datafield>\n"
	-
	- return xml_line
	-
	-
	-def append_subfield_element(xml_line, subfield_code, value):
	- xml_element = '\n <subfield code="' \
	- '%(sf-code-ref-auth)s">%(value)s</subfield>' % {
	- 'value' : encode_for_xml(value),
	- 'sf-code-ref-auth' : subfield_code,
	- }
	- return xml_line + xml_element
	-
	-
	-def start_datafield_element(line_marker):
	- """ Start a brand new datafield element with a marker subfield.
	- @param line_marker: (string) The line marker which will be the sole
	- content of the newly created marker subfield. This will always be the
	- first subfield to be created for a new datafield element.
	- @return: (string) The string holding the relevant datafield and
	- subfield tags.
	- """
	- marker_subfield = """
	- <subfield code="%(sf-code-ref-marker)s">%(marker-val)s</subfield>""" \
	- % {'sf-code-ref-marker': CFG_REFEXTRACT_SUBFIELD_MARKER,
	- 'marker-val' : encode_for_xml(format_marker(line_marker))}
	-
	- new_datafield = """ <datafield tag="%(df-tag-ref)s" ind1="%(df-ind1-ref)s" ind2="%(df-ind2-ref)s">%(marker-subfield)s""" \
	- % {'df-tag-ref' : CFG_REFEXTRACT_TAG_ID_REFERENCE,
	- 'df-ind1-ref' : CFG_REFEXTRACT_IND1_REFERENCE,
	- 'df-ind2-ref' : CFG_REFEXTRACT_IND2_REFERENCE,
	- 'marker-subfield': marker_subfield}
	-
	- return new_datafield
	-
	-
	-def dump_or_split_author(misc_txt, line_elements):
	- """
	- Given the list of current elements, and misc text, try to decide how to use this
	- author for splitting heuristics, and see if it is useful. Returning 'dump' indicates
	- put this author into misc text, since it had been identified as bad. 'split'
	- indicates split the line and place this author into the fresh datafield. The empty string
	- indicates add this author as normal to the current xml datafield.
	-
	- A line will be split using author information in two situations:
	- 1. When there already exists a previous author group in the same line
	- 2. If the only item in the current line is a title, with no misc text
	- In both situations, the newly found author element is placed into the newly created
	- datafield.
	-
	- This method heavily assumes that the first author group found in a single citation is the
	- most reliable (In accordance with the IEEE standard, which states that authors should
	- be written at the beginning of a citation, in the overwhelming majority of cases).
	- @param misc_txt: (string) The misc text for this reference line
	- @param line_elements: (list) The list of elements found for this current line
	- @return: (string) The action to take to deal with this author.
	- """
	- ## If an author has already been found in this reference line
	- if is_in_line_elements("AUTH", line_elements):
	-
	- ## If this author group is directly after another author group,
	- ## with minimal misc text between, then this author group is very likely to be wrong.
	- if line_elements[-1]['type'] == "AUTH" \
	- and len(misc_txt) < CGF_REFEXTRACT_ADJACENT_AUTH_MISC_SEPARATION:
	- return "dump"
	- ## Else, trigger a new reference line
	- return "split"
	-
	- ## In cases where an author is directly after an alone title (ibid or normal, with no misc),
	- ## Trigger a new reference line
	- if is_in_line_elements("JOURNAL", line_elements) and len(line_elements) == 1 \
	- and len(misc_txt) == 0:
	- return "split"
	-
	- return ""
	-
	-
	-def is_in_line_elements(element_type, line_elements):
	- """ Checks the list of current elements in the line for the given element type """
	- for i, element in enumerate(line_elements):
	- if element['type'] == element_type:
	- return (True, line_elements[i])
	- return False
	-
	-
	-def split_on_semi_colon(misc_txt, line_elements, elements_processed, total_elements):
	- """ Given some misc text, see if there are any semi-colons which may indiciate that
	- a reference line is in fact two separate citations.
	- @param misc_txt: (string) The misc_txt to look for semi-colons within.
	- @param line_elements: (list) The list of single upper-case chars which
	- represent an element of a reference which has been processed.
	- @param elements_processed: (integer) The number of elements which have been
	- looked at for this entire reference line, regardless of splits
	- @param total_elements: (integer) The total number of elements which
	- have been identified in the entire reference line
	- @return: (string) Dipicting where the semi-colon was found in relation to the
	- rest of the misc_txt. False if a semi-colon was not found, or one was found
	- relating to an escaped piece of text.
	- """
	- ## If there has already been meaningful information found in the reference
	- ## and there are still elements to be processed beyond the element relating to
	- ## this misc_txt
	- if (is_in_line_elements("JOURNAL", line_elements) \
	- or is_in_line_elements("REPORTNUMBER", line_elements) \
	- or len(misc_txt) >= CGF_REFEXTRACT_SEMI_COLON_MISC_TEXT_SENSITIVITY) \
	- and elements_processed < total_elements:
	-
	- if len(misc_txt) >= 4 and \
	- (misc_txt[-5:] == '&' or misc_txt[-4:] == '<'):
	- ## This is a semi-colon which does not indicate a new citation
	- return ""
	- else:
	- ## If a semi-colon is at the end, make sure to append preceeding misc_txt to
	- ## the current datafield element
	- if misc_txt.strip(" .,")[-1] == ";":
	- return "after"
	- ## Else, make sure to append the misc_txt to the newly created datafield element
	- elif misc_txt.strip(" .,")[0] == ";":
	- return "before"
	- return ""
	-
	-
	-def check_author_for_ibid(line_elements, author):
	- """ Given a list of elements for an entire reference line, and the current
	- author element to be used for ibids, check to see if that author element needs
	- to be inserted into this line, depending on the presence of ibids and whether
	- or not there is already an author paired with an ibid.
	- Also, if no ibids are present in the line, see if the author element needs
	- to be updated, depending on the presence of a normal title and a corresponding
	- author group.
	- @param line_elements: List of line elements for the entire processed reference
	- line
	- @param author: The current parent author element to be used with an ibid
	- @return: (tuple) - containing a possible new author subfield, and the parent
	- author element to be used for future ibids (if any)
	- """
	- ## Upon splitting, check for ibids in the previous line,
	- ## If an appropriate author was found, pair it with this ibid.
	- ## (i.e., an author has not been explicitly paired with this ibid already
	- ## and an author exists with the parent title to which this ibid refers)
	- if is_in_line_elements("JOURNAL", line_elements):
	- ## Get the title element for this line
	- title_element = is_in_line_elements("JOURNAL", line_elements)[1]
	-
	- if author != None and not is_in_line_elements("AUTH", line_elements) \
	- and title_element['is_ibid']:
	- ## Return the author subfield which needs to be appended for an ibid in the line
	- ## No need to reset the author to be used for ibids, since this line holds an ibid
	- return """
	- <subfield code="%(sf-code-ref-auth)s">%(authors)s</subfield>""" \
	- % {'authors' : encode_for_xml(author['auth_txt'].strip('()')),
	- 'sf-code-ref-auth' : CFG_REFEXTRACT_SUBFIELD_AUTH,
	- }, author
	-
	- ## Set the author for to be used for ibids, when a standard title is present in this line,
	- ## as well as an author
	- if not title_element['is_ibid'] and is_in_line_elements("AUTH", line_elements):
	- ## Set the author to be used for ibids, in the event that a subsequent ibid is found
	- ## this author element will be repeated.
	- ## This author is only used when an ibid is in a line
	- ## and there is no other author found in the line.
	- author = is_in_line_elements("AUTH", line_elements)[1]
	- ## If there is no author associated with this head title, clear the author to be used for ibids
	- elif not title_element['is_ibid']:
	- author = None
	-
	- ## If an author does not need to be replicated for an ibid, append nothing to the xml line
	- return "", author
	-
	-
	-def append_datafield_element(line_marker,
	- citation_structure,
	- line_elements,
	- author,
	- xml_line):
	- """ Finish the current datafield element and start a new one, with a new
	- marker subfield.
	- @param line_marker: (string) The line marker which will be the sole
	- content of the newly created marker subfield. This will always be the
	- first subfield to be created for a new datafield element.
	- @return new_datafield: (string) The string holding the relevant
	- datafield and subfield tags.
	- """
	- ## Add an author, if one must be added for ibid's, before splitting this line
	- ## Also, if a standard title and an author are both present, save the author for future use
	- new_datafield, author = check_author_for_ibid(line_elements, author)
	-
	- xml_line += new_datafield
	- ## Start the new datafield
	- xml_line += """
	- </datafield>
	- <datafield tag="%(df-tag-ref)s" ind1="%(df-ind1-ref)s" ind2="%(df-ind2-ref)s">
	- <subfield code="%(sf-code-ref-marker)s">%(marker-val)s</subfield>""" \
	- % {'df-tag-ref' : CFG_REFEXTRACT_TAG_ID_REFERENCE,
	- 'df-ind1-ref' : CFG_REFEXTRACT_IND1_REFERENCE,
	- 'df-ind2-ref' : CFG_REFEXTRACT_IND2_REFERENCE,
	- 'sf-code-ref-marker' : CFG_REFEXTRACT_SUBFIELD_MARKER,
	- 'marker-val' : encode_for_xml(format_marker(line_marker))
	- }
	-
	- ## add the past elements for end previous citation to the citation_structure list
	- ## (citation_structure is a reference to the initial citation_structure list found in the calling method)
	- citation_structure.append(line_elements)
	-
	- ## Clear the elements in the referenced list of elements
	- del line_elements[:]
	-
	- return xml_line, author
	-
	-
	-def filter_processed_references(out):
	- """ apply filters to reference lines found - to remove junk"""
	- reference_lines = out.split('\n')
	-
	- # Removes too long and too short m tags
	- m_restricted, ref_lines = restrict_m_subfields(reference_lines)
	-
	- if m_restricted:
	- a_tag = re.compile('\<subfield code=\"a\"\>(.*?)\<\/subfield\>')
	- for i in range(len(ref_lines)):
	- # Checks to see that the datafield has the attribute ind2="6",
	- # Before looking to see if the subfield code attribute is 'a'
	- if ref_lines[i].find('<datafield tag="999" ind1="C" ind2="6">') != -1 \
	- and (len(ref_lines) - 1) > i:
	- # For each line in this datafield element, try to find the subfield whose code attribute is 'a'
	- while ref_lines[i].find('</datafield>') != -1 and (len(ref_lines) - 1) > i:
	- i += 1
	- # <subfield code="a">Invenio/X.XX.X
	- # refextract/X.XX.X-timestamp-err-repnum-title-URL-misc
	- # remake the "a" tag for new numbe of "m" tags
	- if a_tag.search(ref_lines[i]):
	- data = a_tag.search(ref_lines[i]).group(1)
	- words1 = data.split()
	- words2 = words1[-1].split('-')
	- old_m = int(words2[-1])
	- words2[-1] = str(old_m - m_restricted)
	- data1 = '-'.join(words2)
	- words1[-1] = data1
	- new_data = ' '.join(words1)
	- ref_lines[i] = ' <subfield code="a">' + new_data + '</subfield>'
	- break
	-
	- new_out = '\n'.join([l for l in [rec.rstrip() for rec in ref_lines] if l])
	-
	- if len(reference_lines) != len(new_out):
	- write_message(" * filter results: unfilter references line length is %d and filtered length is %d" \
	- % (len(reference_lines), len(new_out)), verbose=2)
	-
	- return new_out
	-
	-
	-def restrict_m_subfields(reference_lines):
	- """Remove complete datafields which hold ONLY a single 'm' subfield,
	- AND where the misc content is too short or too long to be of use.
	- Min and max lengths derived by inspection of actual data. """
	- min_length = 4
	- max_length = 1024
	- m_tag = re.compile('\<subfield code=\"m\"\>(.*?)\<\/subfield\>')
	- filter_list = []
	- m_restricted = 0
	- for i in range(len(reference_lines)): # set up initial filter
	- filter_list.append(1)
	- for i in range(len(reference_lines)):
	- if m_tag.search(reference_lines[i]):
	- if (i - 2) >= 0 and (i + 1) < len(reference_lines):
	- if reference_lines[i + 1].find('</datafield>') != -1 and \
	- reference_lines[i - 1].find('<subfield code="o">') != -1 and \
	- reference_lines[i - 2].find('<datafield') != -1:
	- ## If both of these are true then its a solitary "m" tag
	- mlength = len(m_tag.search(reference_lines[i]).group(1))
	- if mlength < min_length or mlength > max_length:
	- filter_list[i - 2] = filter_list[i - 1] = filter_list[i] = filter_list[i + 1] = 0
	- m_restricted += 1
	- new_reference_lines = []
	- for i in range(len(reference_lines)):
	- if filter_list[i]:
	- new_reference_lines.append(reference_lines[i])
	- return m_restricted, new_reference_lines
	-
	-
	-def get_subfield_content(line, subfield_code):
	- """ Given a line (subfield element) and a xml code attribute for a subfield element,
	- return the contents of the subfield element.
	- """
	- content = line.split('<subfield code="' + subfield_code + '">')[1]
	- content = content.split('</subfield>')[0]
	- return content
	-
	-
	-def compress_subfields(out, subfield_code):
	- """
	- For each datafield, compress multiple subfields of type 'subfield_code' into a single one
	- e.g. for MISC text, change xml format from:
	- <datafield tag="999" ind1="C" ind2="5">
	- <subfield code="o">1.</subfield>
	- <subfield code="m">J. Dukelsky, S. Pittel and G. Sierra</subfield>
	- <subfield code="s">Rev. Mod. Phys. 76 (2004) 643</subfield>
	- <subfield code="m">and this is some more misc text</subfield>
	- </datafield>
	- <datafield tag="999" ind1="C" ind2="5">
	- <subfield code="o">2.</subfield>
	- <subfield code="m">J. von Delft and D.C. Ralph,</subfield>
	- <subfield code="s">Phys. Rep. 345 (2001) 61</subfield>
	- </datafield>
	- to:
	- <datafield tag="999" ind1="C" ind2="5">
	- <subfield code="o">1.</subfield>
	- <subfield code="m">J. Dukelsky, S. Pittel and G. Sierra and this is some more misc text</subfield>
	- <subfield code="s">Rev. Mod. Phys. 76 (2004) 643</subfield>
	- </datafield>
	- <datafield tag="999" ind1="C" ind2="5">
	- <subfield code="o">2.</subfield>
	- <subfield code="m">J. von Delft and D.C. Ralph,</subfield>
	- <subfield code="s">Phys. Rep. 345 (2001) 61</subfield>
	- </datafield>
	- """
	- in_lines = out.split('\n')
	- # hold the subfield compressed version of the xml, line by line
	- new_rec_lines = []
	- # Used to indicate when the selected subfield has already been reached
	- # inside a particular datafield
	- position = 0
	- # Where the concatenated misc text is held before appended at the end
	- content_text = ""
	- # Components of the misc subfield elements
	- subfield_start = " <subfield code=\"%s\">" % subfield_code
	- subfield_end = "</subfield>"
	-
	- for line in in_lines:
	- ## If reached the end of the datafield
	- if line.find('</datafield>') != -1:
	- if len(content_text) > 0:
	- # Insert the concatenated misc contents back where it was first
	- # encountered (dont RIGHTstrip semi-colons, as these may be
	- # needed for & or <)
	- if subfield_code == 'm':
	- content_text = content_text.strip(" ,.").lstrip(" ;")
	- new_rec_lines[position] = new_rec_lines[position] + \
	- content_text + subfield_end
	- content_text = ""
	- position = 0
	- new_rec_lines.append(line)
	- # Found subfield in question, concatenate subfield contents
	- # for this single datafield
	- elif line.find(subfield_start.strip()) != -1:
	- if position == 0:
	- ## Save the position of this found subfield
	- ## for later insertion into the same place
	- new_rec_lines.append(subfield_start)
	- position = len(new_rec_lines) - 1
	- new_text = get_subfield_content(line, subfield_code)
	- if content_text and new_text:
	- ## Append spaces between merged text, if needed
	- if (content_text[-1] + new_text[0]).find(" ") == -1:
	- new_text = " " + new_text
	- content_text += new_text
	- else:
	- new_rec_lines.append(line)
	-
	- ## Create the readable file from the list of lines.
	- new_out = [l.rstrip() for l in new_rec_lines]
	- return '\n'.join(filter(None, new_out))
	diff --git a/modules/miscutil/lib/testutils.py b/modules/miscutil/lib/testutils.py
	index fdf289605..283e11bb4 100644
	--- a/modules/miscutil/lib/testutils.py
	+++ b/modules/miscutil/lib/testutils.py
	@@ -1,915 +1,931 @@
	## This file is part of Invenio.
	## Copyright (C) 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013 CERN.
	##
	## Invenio is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## Invenio is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	# pylint: disable=E1102

	"""
	Helper functions for building and running test suites.
	"""

	__revision__ = "$Id$"

	CFG_TESTUTILS_VERBOSE = 1

	import os
	import sys
	import time
	import unittest
	import cgi
	import subprocess
	+import difflib

	from warnings import warn
	from urlparse import urlsplit, urlunsplit
	from urllib import urlencode
	from itertools import chain, repeat
	+from xml.dom.minidom import parseString

	try:
	from selenium import webdriver
	from selenium.webdriver.support.ui import WebDriverWait
	except ImportError:
	# web tests will not be available, but unit and regression tests will:
	pass

	from invenio.config import CFG_SITE_URL, \
	CFG_SITE_SECURE_URL, CFG_LOGDIR, CFG_SITE_NAME_INTL, CFG_PYLIBDIR, \
	CFG_JSTESTDRIVER_PORT, CFG_WEBDIR, CFG_PREFIX
	from invenio.w3c_validator import w3c_validate, w3c_errors_to_str, \
	CFG_TESTS_REQUIRE_HTML_VALIDATION
	from invenio.pluginutils import PluginContainer

	try:
	from nose.tools import nottest
	except ImportError:
	def nottest(f):
	"""Helper decorator to mark a function as not to be tested by nose."""
	f.__test__ = False
	return f

	@nottest
	def warn_user_about_tests(test_suite_type='regression'):
	"""
	Display a standard warning about running tests that might modify
	user data, and wait for user confirmation, unless --yes-i-know
	was specified in the comman line.
	"""

	# Provide a command line option to avoid having to type the
	# confirmation every time during development.
	if '--yes-i-know' in sys.argv:
	return

	if test_suite_type == 'web':
	sys.stderr.write("""\
	**********************************************************************

	A B O U T T H E W E B T E S T S U I T E

	The web test suite will be launched in Firefox. You must have
	the Selenium IDE extension installed to be able to run the web
	test suite. If you do, please check out the results of the web
	test suite in the Selenium IDE window.

	**********************************************************************

	""")

	sys.stderr.write("""\
	**********************************************************************

	I M P O R T A N T W A R N I N G

	The %s test suite needs to be run on a clean demo site
	that you can obtain by doing:

	$ inveniocfg --drop-demo-site \
	--create-demo-site \
	--load-demo-records

	Note that DOING THE ABOVE WILL ERASE YOUR ENTIRE DATABASE.

	In addition, due to the write nature of some of the tests,
	the demo DATABASE will be ALTERED WITH JUNK DATA, so that
	it is recommended to rebuild the demo site anew afterwards.

	**********************************************************************

	Please confirm by typing 'Yes, I know!': """ % test_suite_type)

	answer = raw_input('')
	if answer != 'Yes, I know!':
	sys.stderr.write("Aborted.\n")
	raise SystemExit(0)

	return

	@nottest
	def make_test_suite(*test_cases):
	""" Build up a test suite given separate test cases"""
	return unittest.TestSuite([unittest.makeSuite(case, 'test')
	for case in test_cases])

	@nottest
	def run_test_suite(testsuite, warn_user=False):
	"""
	Convenience function to embed in test suites. Run given testsuite
	and eventually ask for confirmation of warn_user is True.
	"""
	if warn_user:
	warn_user_about_tests()
	res = unittest.TextTestRunner(verbosity=2).run(testsuite)
	return res.wasSuccessful()

	def make_url(path, **kargs):
	""" Helper to generate an absolute invenio URL with query
	arguments"""

	url = CFG_SITE_URL + path

	if kargs:
	url += '?' + urlencode(kargs, doseq=True)

	return url

	def make_surl(path, **kargs):
	""" Helper to generate an absolute invenio Secure URL with query
	arguments"""

	url = CFG_SITE_SECURE_URL + path

	if kargs:
	url += '?' + urlencode(kargs, doseq=True)

	return url

	class InvenioTestUtilsBrowserException(Exception):
	"""Helper exception for the regression test suite browser."""
	pass

	@nottest
	def test_web_page_existence(url):
	"""
	Test whether URL exists and is well accessible.
	Return True or raise exception in case of problems.
	"""
	import mechanize
	browser = mechanize.Browser()
	try:
	browser.open(url)
	except:
	raise
	return True

	def get_authenticated_mechanize_browser(username="guest", password=""):
	"""
	Return an instance of a mechanize browser already authenticated
	to Invenio
	"""
	try:
	import mechanize
	except ImportError:
	raise InvenioTestUtilsBrowserException('ERROR: Cannot import mechanize.')
	browser = mechanize.Browser()
	browser.set_handle_robots(False) # ignore robots.txt, since we test gently
	if username == "guest":
	return browser
	browser.open(CFG_SITE_SECURE_URL + "/youraccount/login")
	browser.select_form(nr=0)
	browser['p_un'] = username
	browser['p_pw'] = password
	browser.submit()
	username_account_page_body = browser.response().read()
	try:
	username_account_page_body.index("You are logged in as %s." % username)
	except ValueError:
	raise InvenioTestUtilsBrowserException('ERROR: Cannot login as %s.' % username)
	return browser

	@nottest
	def test_web_page_content(url,
	username="guest",
	password="",
	expected_text="</html>",
	unexpected_text="",
	expected_link_target=None,
	expected_link_label=None,
	require_validate_p=CFG_TESTS_REQUIRE_HTML_VALIDATION):
	"""Test whether web page URL as seen by user USERNAME contains
	text EXPECTED_TEXT and, eventually, contains a link to
	EXPECTED_LINK_TARGET (if set) labelled EXPECTED_LINK_LABEL (if
	set). The EXPECTED_TEXT is checked via substring matching, the
	EXPECTED_LINK_TARGET and EXPECTED_LINK_LABEL via exact string
	matching.

	EXPECTED_TEXT, EXPECTED_LINK_LABEL and EXPECTED_LINK_TARGET can
	either be strings or list of strings (in order to check multiple
	values inside same page).

	Before doing the tests, login as USERNAME with password
	PASSWORD. E.g. interesting values for USERNAME are "guest" or
	"admin".

	Return empty list in case of no problems, otherwise list of error
	messages that may have been encountered during processing of
	page.
	"""
	try:
	import mechanize
	except ImportError:
	raise InvenioTestUtilsBrowserException('ERROR: Cannot import mechanize.')
	if '--w3c-validate' in sys.argv:
	require_validate_p = True
	sys.stderr.write('Required validation\n')

	error_messages = []
	try:
	browser = get_authenticated_mechanize_browser(username, password)
	browser.open(url)
	url_body = browser.response().read()

	# now test for EXPECTED_TEXT:
	# first normalize expected_text
	if isinstance(expected_text, str):
	expected_texts = [expected_text]
	else:
	expected_texts = expected_text
	# then test
	for cur_expected_text in expected_texts:
	try:
	url_body.index(cur_expected_text)
	except ValueError:
	raise InvenioTestUtilsBrowserException, \
	'ERROR: Page %s (login %s) does not contain %s, but contains %s' % \
	(url, username, cur_expected_text, url_body)

	# now test for UNEXPECTED_TEXT:
	# first normalize unexpected_text
	if isinstance(unexpected_text, str):
	if unexpected_text:
	unexpected_texts = [unexpected_text]
	else:
	unexpected_texts = []
	else:
	unexpected_texts = unexpected_text
	# then test
	for cur_unexpected_text in unexpected_texts:
	try:
	url_body.index(cur_unexpected_text)
	raise InvenioTestUtilsBrowserException, \
	'ERROR: Page %s (login %s) contains %s.' % \
	(url, username, cur_unexpected_text)
	except ValueError:
	pass

	# now test for EXPECTED_LINK_TARGET and EXPECTED_LINK_LABEL:
	if expected_link_target or expected_link_label:
	# first normalize expected_link_target and expected_link_label
	if isinstance(expected_link_target, str) or \
	expected_link_target is None:
	expected_link_targets = [expected_link_target]
	else:
	expected_link_targets = expected_link_target
	if isinstance(expected_link_label, str) or \
	expected_link_label is None:
	expected_link_labels = [expected_link_label]
	else:
	expected_link_labels = expected_link_label
	max_links = max(len(expected_link_targets), len(expected_link_labels))
	expected_link_labels = chain(expected_link_labels, repeat(None))
	expected_link_targets = chain(expected_link_targets, repeat(None))
	# then test
	for dummy in range(0, max_links):
	cur_expected_link_target = expected_link_targets.next()
	cur_expected_link_label = expected_link_labels.next()
	try:
	browser.find_link(url=cur_expected_link_target,
	text=cur_expected_link_label)
	except mechanize.LinkNotFoundError:
	raise InvenioTestUtilsBrowserException, \
	'ERROR: Page %s (login %s) does not contain link to %s entitled %s.' % \
	(url, username, cur_expected_link_target, cur_expected_link_label)

	# now test for validation if required
	if require_validate_p:
	valid_p, errors, warnings = w3c_validate(url_body)
	if not valid_p:
	error_text = 'ERROR: Page %s (login %s) does not validate:\n %s' % \
	(url, username, w3c_errors_to_str(errors, warnings))
	open('%s/w3c-markup-validator.log' % CFG_LOGDIR, 'a').write(error_text)
	raise InvenioTestUtilsBrowserException, error_text


	except mechanize.HTTPError, msg:
	error_messages.append('ERROR: Page %s (login %s) not accessible. %s' % \
	(url, username, msg))
	except InvenioTestUtilsBrowserException, msg:
	error_messages.append('ERROR: Page %s (login %s) led to an error: %s.' % \
	(url, username, msg))

	try:
	# logout after tests:
	browser.open(CFG_SITE_SECURE_URL + "/youraccount/logout")
	browser.response().read()
	browser.close()
	except UnboundLocalError:
	pass

	if CFG_TESTUTILS_VERBOSE >= 9:
	print "%s test_web_page_content(), tested page `%s', login `%s', expected text `%s', errors `%s'." % \
	(time.strftime("%Y-%m-%d %H:%M:%S -->", time.localtime()),
	url, username, expected_text,
	",".join(error_messages))

	return error_messages

	def merge_error_messages(error_messages):
	"""If the ERROR_MESSAGES list is non-empty, merge them and return nicely
	formatted string suitable for printing. Otherwise return empty
	string.
	"""
	out = ""
	if error_messages:
	out = "\n* " + "\n* ".join(error_messages)
	return out

	@nottest
	def build_and_run_unit_test_suite():
	"""
	Detect all Invenio modules with names ending by '*_unit_tests.py', build
	a complete test suite of them, and run it.
	Called by 'inveniocfg --run-unit-tests'.
	"""

	test_modules_map = PluginContainer(
	os.path.join(CFG_PYLIBDIR, 'invenio', '*_unit_tests.py'),
	lambda plugin_name, plugin_code: getattr(plugin_code, "TEST_SUITE"))
	test_modules = [test_modules_map[name] for name in test_modules_map]

	broken_tests = test_modules_map.get_broken_plugins()

	broken_unit_tests = ['%s (reason: %s)' % (name, broken_tests[name][1]) for name in broken_tests]
	if broken_unit_tests:
	warn("Broken unit tests suites found: %s" % ', '.join(broken_unit_tests))

	complete_suite = unittest.TestSuite(test_modules)
	res = unittest.TextTestRunner(verbosity=2).run(complete_suite)
	return res.wasSuccessful()

	@nottest
	def build_and_run_js_unit_test_suite():
	"""
	Init the JsTestDriver server, detect all Invenio JavaScript files with
	names ending by '*_tests.js' and run them.
	Called by 'inveniocfg --run-js-unit-tests'.
	"""
	def _server_init(server_process):
	"""
	Init JsTestDriver server and check if it succedeed
	"""
	output_success = "Finished action run"
	output_error = "Server failed"
	read_timeout = 30

	start_time = time.time()
	elapsed_time = 0
	while 1:
	stdout_line = server_process.stdout.readline()
	if output_success in stdout_line:
	print '* JsTestDriver server ready\n'
	return True
	elif output_error in stdout_line or elapsed_time > read_timeout:
	print '* ! JsTestDriver server init failed\n'
	print server_process.stdout.read()
	return False
	elapsed_time = time.time() - start_time

	def _find_and_run_js_test_files():
	"""
	Find all JS files installed in Invenio lib directory and run
	them on the JsTestDriver server
	"""
	from invenio.shellutils import run_shell_command
	errors_found = 0
	for candidate in os.listdir(CFG_WEBDIR + "/js"):
	base, ext = os.path.splitext(candidate)

	if ext != '.js' or not base.endswith('_tests'):
	continue

	print "Found test file %s. Running tests... " % (base + ext)
	dummy_current_exitcode, cmd_stdout, dummy_err_msg = run_shell_command(cmd="java -jar %s/JsTestDriver.jar --config %s --tests all" % \
	(CFG_PREFIX + "/lib/java/js-test-driver", CFG_WEBDIR + "/js/" + base + '.conf'))
	print cmd_stdout
	if "Fails: 0" not in cmd_stdout:
	errors_found += 1
	print errors_found
	return errors_found

	print "Going to start JsTestDriver server..."
	server_process = subprocess.Popen(["java", "-jar",
	"%s/JsTestDriver.jar" % (CFG_PREFIX + "/lib/java/js-test-driver"), "--runnerMode", "INFO",
	"--port", "%d" % CFG_JSTESTDRIVER_PORT],
	stdout=subprocess.PIPE, stderr=subprocess.STDOUT)

	try:
	if not _server_init(server_process):
	# There was an error initialising server
	return 1

	print "Now you can capture the browsers where you would " \
	"like to run the tests by opening the following url:\n" \
	"%s:%d/capture \n" % (CFG_SITE_URL, CFG_JSTESTDRIVER_PORT)

	print "Press enter when you are ready to run tests"
	raw_input()

	exitcode = _find_and_run_js_test_files()
	finally:
	server_process.kill()

	return exitcode

	@nottest
	def build_and_run_regression_test_suite():
	"""
	Detect all Invenio modules with names ending by
	'*_regression_tests.py', build a complete test suite of them, and
	run it. Called by 'inveniocfg --run-regression-tests'.
	"""

	test_modules_map = PluginContainer(
	os.path.join(CFG_PYLIBDIR, 'invenio', '*_regression_tests.py'),
	lambda plugin_name, plugin_code: getattr(plugin_code, "TEST_SUITE"))
	test_modules = test_modules_map.values()

	broken_tests = test_modules_map.get_broken_plugins()

	broken_regression_tests = ['%s (reason: %s)' % (name, broken_tests[name][1]) for name in broken_tests]
	if broken_regression_tests:
	warn("Broken regression tests suites found: %s" % ', '.join(broken_regression_tests))

	warn_user_about_tests()

	complete_suite = unittest.TestSuite(test_modules)
	res = unittest.TextTestRunner(verbosity=2).run(complete_suite)
	return res.wasSuccessful()

	@nottest
	def build_and_run_web_test_suite():
	"""
	Detect all Invenio modules with names ending by
	'*_web_tests.py', build a complete test suite of them, and
	run it. Called by 'inveniocfg --run-web-tests'.
	"""

	test_modules_map = PluginContainer(
	os.path.join(CFG_PYLIBDIR, 'invenio', '*_web_tests.py'),
	lambda plugin_name, plugin_code: getattr(plugin_code, "TEST_SUITE"))
	test_modules = test_modules_map.values()

	broken_tests = test_modules_map.get_broken_plugins()

	broken_web_tests = ['%s (reason: %s)' % (name, broken_tests[name][1]) for name in broken_tests]
	if broken_web_tests:
	warn("Broken web tests suites found: %s" % ', '.join(broken_web_tests))

	warn_user_about_tests()

	complete_suite = unittest.TestSuite(test_modules)
	res = unittest.TextTestRunner(verbosity=2).run(complete_suite)
	return res.wasSuccessful()


	class InvenioTestCase(unittest.TestCase):
	"Invenio Test Case class."
	pass

	class InvenioWebTestCase(unittest.TestCase):
	""" Helper library of useful web test functions
	for web tests creation.
	"""

	def setUp(self):
	"""Initialization before tests."""

	## Let's default to English locale
	profile = webdriver.FirefoxProfile()
	profile.set_preference('intl.accept_languages', 'en-us, en')
	profile.update_preferences()

	# the instance of Firefox WebDriver is created
	self.browser = webdriver.Firefox(profile)

	# list of errors
	self.errors = []

	def tearDown(self):
	"""Cleanup actions after tests."""

	self.browser.quit()
	self.assertEqual([], self.errors)

	def find_element_by_name_with_timeout(self, element_name, timeout=30):
	""" Find an element by name. This waits up to 'timeout' seconds
	before throwing an InvenioWebTestCaseException or if it finds the
	element will return it in 0 - timeout seconds.
	@param element_name: name of the element to find
	@type element_name: string
	@param timeout: time in seconds before throwing an exception
	if the element is not found
	@type timeout: int
	"""

	try:
	WebDriverWait(self.browser, timeout).until(lambda driver: driver.find_element_by_name(element_name))
	except:
	raise InvenioWebTestCaseException(element=element_name)

	def find_element_by_link_text_with_timeout(self, element_link_text, timeout=30):
	""" Find an element by link text. This waits up to 'timeout' seconds
	before throwing an InvenioWebTestCaseException or if it finds the element
	will return it in 0 - timeout seconds.
	@param element_link_text: link text of the element to find
	@type element_link_text: string
	@param timeout: time in seconds before throwing an exception
	if the element is not found
	@type timeout: int
	"""

	try:
	WebDriverWait(self.browser, timeout).until(lambda driver: driver.find_element_by_link_text(element_link_text))
	except:
	raise InvenioWebTestCaseException(element=element_link_text)

	def find_element_by_partial_link_text_with_timeout(self, element_partial_link_text, timeout=30):
	""" Find an element by partial link text. This waits up to 'timeout' seconds
	before throwing an InvenioWebTestCaseException or if it finds the element
	will return it in 0 - timeout seconds.
	@param element_partial_link_text: partial link text of the element to find
	@type element_partial_link_text: string
	@param timeout: time in seconds before throwing an exception
	if the element is not found
	@type timeout: int
	"""

	try:
	WebDriverWait(self.browser, timeout).until(lambda driver: driver.find_element_by_partial_link_text(element_partial_link_text))
	except:
	raise InvenioWebTestCaseException(element=element_partial_link_text)

	def find_element_by_id_with_timeout(self, element_id, timeout=30, text=""):
	""" Find an element by id. This waits up to 'timeout' seconds
	before throwing an InvenioWebTestCaseException or if it finds the element
	will return it in 0 - timeout seconds.
	If the parameter text is provided, the function waits
	until the element is found and its content is equal to the given text.
	If the element's text is not equal to the given text an exception will be raised
	and the result of this comparison will be stored in the errors list
	#NOTE: Currently this is used to wait for an element's text to be
	refreshed using JavaScript
	@param element_id: id of the element to find
	@type element_id: string
	@param timeout: time in seconds before throwing an exception
	if the element is not found
	@type timeout: int
	@param text: expected text inside the given element.
	@type text: string
	"""

	try:
	WebDriverWait(self.browser, timeout).until(lambda driver: driver.find_element_by_id(element_id))
	except:
	raise InvenioWebTestCaseException(element=element_id)

	if text:
	q = self.browser.find_element_by_id(element_id)
	try:
	# if the element's text is not equal to the given text, an exception will be raised
	WebDriverWait(self.browser, timeout).until(lambda driver: driver.find_element_by_id(element_id) and q.text==text)
	except:
	# let's store the result of the comparison in the errors list
	try:
	self.assertEqual(q.text, text)
	except AssertionError, e:
	self.errors.append(str(e))

	def find_element_by_xpath_with_timeout(self, element_xpath, timeout=30):
	""" Find an element by xpath. This waits up to 'timeout' seconds
	before throwing an InvenioWebTestCaseException or if it finds the element
	will return it in 0 - timeout seconds.
	@param element_xpath: xpath of the element to find
	@type element_xpath: string
	@param timeout: time in seconds before throwing an exception
	if the element is not found
	@type timeout: int
	"""

	try:
	WebDriverWait(self.browser, timeout).until(lambda driver: driver.find_element_by_xpath(element_xpath))
	except:
	raise InvenioWebTestCaseException(element=element_xpath)

	def find_elements_by_class_name_with_timeout(self, element_class_name, timeout=30):
	""" Find an element by class name. This waits up to 'timeout' seconds
	before throwing an InvenioWebTestCaseException or if it finds the element
	will return it in 0 - timeout seconds.
	@param element_class_name: class name of the element to find
	@type element_class_name: string
	@param timeout: time in seconds before throwing an exception
	if the element is not found
	@type timeout: int
	"""

	try:
	WebDriverWait(self.browser, timeout).until(lambda driver: driver.find_element_by_class_name(element_class_name))
	except:
	raise InvenioWebTestCaseException(element=element_class_name)

	def find_page_source_with_timeout(self, timeout=30):
	""" Find the page source. This waits up to 'timeout' seconds
	before throwing an InvenioWebTestCaseException
	or if the page source is loaded will return it
	in 0 - timeout seconds.
	@param timeout: time in seconds before throwing an exception
	if the page source is not found
	@type timeout: int
	"""

	try:
	WebDriverWait(self.browser, timeout).until(lambda driver: driver.page_source)
	except:
	raise InvenioWebTestCaseException(element="page source")

	def login(self, username="guest", password="", force_ln='en', go_to_login_page=True):
	""" Login function
	@param username: the username (nickname or email)
	@type username: string
	@param password: the corresponding password
	@type password: string
	@param force_ln: if the arrival page doesn't use the corresponding
	language, then the browser will redirect to it.
	@type force_ln: string
	@param go_to_login_page: if True, look for login link on the
	page. Otherwise expect to be already
	on a page with the login form
	@type go_to_login_page: bool
	"""
	if go_to_login_page:
	if not "You can use your nickname or your email address to login." in self.browser.page_source:
	if "You are no longer recognized by our system" in self.browser.page_source:
	self.find_element_by_link_text_with_timeout("login here")
	self.browser.find_element_by_link_text("login here").click()
	else:
	self.find_element_by_link_text_with_timeout("login")
	self.browser.find_element_by_link_text("login").click()

	self.find_element_by_name_with_timeout("p_un")
	self.browser.find_element_by_name("p_un").clear()
	self.fill_textbox(textbox_name="p_un", text=username)
	self.find_element_by_name_with_timeout("p_pw")
	self.browser.find_element_by_name("p_pw").clear()
	self.fill_textbox(textbox_name="p_pw", text=password)
	self.find_element_by_name_with_timeout("action")
	self.browser.find_element_by_name("action").click()
	if force_ln and CFG_SITE_NAME_INTL[force_ln] not in self.browser.page_source:
	splitted_url = list(urlsplit(self.browser.current_url))
	query = cgi.parse_qs(splitted_url[3])
	query.update({u'ln': unicode(force_ln)})
	splitted_url[3] = urlencode(query)
	new_url = urlunsplit(splitted_url)
	self.browser.get(new_url)

	def logout(self):
	""" Logout function
	"""

	self.find_element_by_link_text_with_timeout("logout")
	self.browser.find_element_by_link_text("logout").click()

	@nottest
	def element_value_test(self, element_name="", element_id="", \
	expected_element_value="", unexpected_element_value="", in_form=True):
	""" Function to check if the value in the given
	element is the expected (unexpected) value or not
	@param element_name: name of the corresponding element in the form
	@type element_name: string
	@param element_id: id of the corresponding element in the form
	@type element_id: string
	@param expected_element_value: the expected element value
	@type expected_element_value: string
	@param unexpected_element_value: the unexpected element value
	@type unexpected_element_value: string
	@param in_form: depends on this parameter, the value of the given element
	is got in a different way. If it is True, the given element is a textbox
	or a textarea in a form.
	@type in_form: boolean
	"""

	if element_name:
	self.find_element_by_name_with_timeout(element_name)
	q = self.browser.find_element_by_name(element_name)
	elif element_id:
	self.find_element_by_id_with_timeout(element_id)
	q = self.browser.find_element_by_id(element_id)

	if unexpected_element_value:
	try:
	if in_form:
	self.assertNotEqual(q.get_attribute('value'), unexpected_element_value)
	else:
	self.assertNotEqual(q.text, unexpected_element_value)
	except AssertionError, e:
	self.errors.append(str(e))

	if expected_element_value:
	try:
	if in_form:
	self.assertEqual(q.get_attribute('value'), expected_element_value)
	else:
	self.assertEqual(q.text, expected_element_value)
	except AssertionError, e:
	self.errors.append(str(e))

	@nottest
	def page_source_test(self, expected_text="", unexpected_text=""):
	""" Function to check if the current page contains
	the expected text (unexpected text) or not.
	The expected text (unexpected text) can also be
	a link.
	The expected text (unexpected text) can be a list of strings
	in order to check multiple values inside same page
	@param expected_text: the expected text
	@type expected_text: string or list of strings
	@param unexpected_text: the unexpected text
	@type unexpected_text: string or list of strings
	"""

	self.find_page_source_with_timeout()
	if unexpected_text:
	if isinstance(unexpected_text, str):
	unexpected_texts = [unexpected_text]
	else:
	unexpected_texts = unexpected_text

	for unexpected_text in unexpected_texts:
	try:
	self.assertEqual(-1, self.browser.page_source.find(unexpected_text))
	except AssertionError, e:
	self.errors.append(str(e))

	if expected_text:
	if isinstance(expected_text, str):
	expected_texts = [expected_text]
	else:
	expected_texts = expected_text

	for expected_text in expected_texts:
	try:
	self.assertNotEqual(-1, self.browser.page_source.find(expected_text))
	except AssertionError, e:
	self.errors.append(str(e))

	def choose_selectbox_option_by_label(self, selectbox_name="", selectbox_id="", label=""):
	""" Select the option at the given label in
	the corresponding select box
	@param selectbox_name: the name of the corresponding
	select box in the form
	@type selectbox_name: string
	@param selectbox_id: the id of the corresponding
	select box in the form
	@type selectbox_id: string
	@param label: the option at this label will be selected
	@type label: string
	"""

	if selectbox_name:
	self.find_element_by_name_with_timeout(selectbox_name)
	selectbox = self.browser.find_element_by_name(selectbox_name)
	elif selectbox_id:
	self.find_element_by_id_with_timeout(selectbox_id)
	selectbox = self.browser.find_element_by_id(selectbox_id)

	options = selectbox.find_elements_by_tag_name("option")
	for option in options:
	if option.text == label:
	option.click()
	break

	def choose_selectbox_option_by_index(self, selectbox_name="", selectbox_id="", index=""):
	""" Select the option at the given index in
	the corresponding select box
	@param selectbox_name: the name of the corresponding
	select box in the form
	@type selectbox_name: string
	@param selectbox_id: the id of the corresponding
	select box in the form
	@type selectbox_id: string
	@param index: the option at this index will be selected
	@type index: int
	"""

	if selectbox_name:
	self.find_element_by_name_with_timeout(selectbox_name)
	selectbox = self.browser.find_element_by_name(selectbox_name)
	elif selectbox_id:
	self.find_element_by_id_with_timeout(selectbox_id)
	selectbox = self.browser.find_element_by_id(selectbox_id)

	options = selectbox.find_elements_by_tag_name("option")
	options[int(index)].click()

	def choose_selectbox_option_by_value(self, selectbox_name="", selectbox_id="", value=""):
	""" Select the option at the given value in
	the corresponding select box
	@param selectbox_name: the name of the corresponding
	select box in the form
	@type selectbox_id: string
	@param selectbox_id: the id of the corresponding
	select box in the form
	@type selectbox_id: string
	@param value: the option at this value will be selected
	@type value: string
	"""

	if selectbox_name:
	self.find_element_by_name_with_timeout(selectbox_name)
	selectbox = self.browser.find_element_by_name(selectbox_name)
	elif selectbox_id:
	self.find_element_by_id_with_timeout(selectbox_id)
	selectbox = self.browser.find_element_by_id(selectbox_id)

	options = selectbox.find_elements_by_tag_name("option")
	for option in options:
	if option.get_attribute('value') == value:
	option.click()
	break

	def fill_textbox(self, textbox_name="", textbox_id="", text=""):
	""" Fill in the input textbox or textarea with the given text
	@param textbox_name: the name of the corresponding textbox
	or text area in the form
	@type textbox_name: string
	@param textbox_id: the id of the corresponding textbox
	or text area in the form
	@type textbox_id: string
	@param text: the information that the user wants to send
	@type text: string
	"""

	if textbox_name:
	self.find_element_by_name_with_timeout(textbox_name)
	textbox = self.browser.find_element_by_name(textbox_name)
	elif textbox_id:
	self.find_element_by_id_with_timeout(textbox_id)
	textbox = self.browser.find_element_by_id(textbox_id)

	textbox.send_keys(text)

	def handle_popup_dialog(self):
	""" Access the alert after triggering an action
	that opens a popup. """

	try:
	alert = self.browser.switch_to_alert()
	alert.accept()
	except:
	pass


	class InvenioWebTestCaseException(Exception):
	"""This exception is thrown if the element
	we are looking for is not found after a set time period.
	The element is not found because the page needs more
	time to be fully loaded. To avoid this exception,
	we should increment the time period for that element in
	the corresponding function. See also:
	find_element_by_name_with_timeout()
	find_element_by_link_text_with_timeout()
	find_element_by_partial_link_text_with_timeout()
	find_element_by_id_with_timeout()
	find_element_by_xpath_with_timeout()
	find_elements_by_class_name_with_timeout()
	find_page_source_with_timeout()
	"""

	def __init__(self, element):
	"""Initialisation."""
	self.element = element
	self.message = "Time for finding the element '%s' has expired" % self.element

	def __str__(self):
	"""String representation."""
	return repr(self.message)
	+
	+
	+class XmlTest(unittest.TestCase):
	+ def assertXmlEqual(self, got, want):
	+ xml_lines = parseString(got).toprettyxml(encoding='utf-8').split('\n')
	+ xml = '\n'.join(line for line in xml_lines if line.strip())
	+ xml2_lines = parseString(want).toprettyxml(encoding='utf-8').split('\n')
	+ xml2 = '\n'.join(line for line in xml2_lines if line.strip())
	+ try:
	+ self.assertEqual(xml, xml2)
	+ except AssertionError:
	+ for line in difflib.unified_diff(xml.split('\n'), xml2.split('\n')):
	+ print line.strip('\n')
	+ raise

No OneTemporaryActions

File Metadata

View Options

Event Timeline

No OneTemporary
Actions