Homec4science

refextract: introduce daemon operation mode

Description

refextract: introduce daemon operation mode

  • Convert Refextract in a form allowable for the submission of extraction tasks via Bibtask, for Bibsched, but preserving the independent nature of Refextract. (Running Refextract as default will cause it to be scheduled, but when given a fulltext input [using -f, --fulltext], it will run in the original standalone mode).
  • Change the method of providing fulltext documents for extraction, so as to differentiate between running in standlone mode, and running as a scheduled task: -f and --fulltext are now used, to denote each single fulltext document.
  • Add two intermediate files: 'refextract_cli.py' and 'refextract_daemon.py' which will handle the execution mode of Refextract, and the submission of a Refextract task to Bibsched.
  • Provide the ability for Refextract to run on specific collections and records, using the flags -c --collection and -i --recid.
  • Provide the ability for Refextract to construct a new scheduled extraction job, using a predefined 'job configuration file', by specifying the name of the job to run (using -e, --extraction-job). Each job corresponds to a matching named job file under /etc/bibedit, holding the parameters for the job.
  • Add functionality to interact with a new db table called xtrJOB, which holds the id, name, and last_updated information for each ran job task (specified using -e, or --extraction-job). Use the last_updated info to compare against the modification_date of each record; Only newly updated files are scheduled to have their references re-extracted.
  • Include an extraction job file (refextract-job-preprints) to act as an example template.
  • Include in refextract_config, a list of acceptable job parameters which are allowed to be specified inside a Refextract job description file.
  • Change the '-s' flag for controlling the appearance of journal standard reference form to '-p' so as not to interfere with the sleep cli option for Bibsched.
  • Update Makefile.am to reflect the addition of refextract_daemon and refextract_cli files, and also the presence of the template extraction job file.
  • Update the refextract-specific bibtask_config.py default values for recids and collections as empty lists. These are filled with the location of fulltext documents when starting Refextract inside Bibsched.
  • Handle all error messages regardless of the mode that Refextract is running in. (Short error messages are shown inside the Bibsched interface under the 'progress' column, and all are sent to the Bibsched log when Refextract is scheduled. Stdout or stderr are used when running Refextract as standalone. Stdout and stderr are also used when no xml file has been specified to hold the extracted references).
  • Update the oai_harvest_daemon to call Refextract using the new fulltext flag (-f, --fulltext).
  • Include the default author kb location, used on ImportError.
  • Display an error message and halt in the situation where a user specifies an extraction-job to run, alongside other cli options or a path to a fulltext document from which to extract, and other daemon-specific flags (--collection, --extraction-job).
  • Display the full directory in the error message when an extraction-job config file has not been found.
  • Show in --help the three main modes for which to run Refextract.
  • Inside extraction-job files, accept either an absolute path or a base name when referencing report number and journal name knowledge bases. In the situation where the absolute path is omitted, the daemon falls back to the Invenio 'etc' directory.
  • Add the xtrJOB table description to tabcreate.

(closes #786)

Details

Event Timeline

Christopher Hayward <christopher.james.hayward@cern.ch> committed R3600:85a4d09546e5: refextract: introduce daemon operation mode (authored by Christopher Hayward <christopher.james.hayward@cern.ch>).Aug 24 2011, 13:44