diff --git a/modules/bibupload/doc/admin/bibupload-admin-guide.webdoc b/modules/bibupload/doc/admin/bibupload-admin-guide.webdoc index 84dee933c..4010166ba 100644 --- a/modules/bibupload/doc/admin/bibupload-admin-guide.webdoc +++ b/modules/bibupload/doc/admin/bibupload-admin-guide.webdoc @@ -1,772 +1,772 @@ ## -*- mode: html; coding: utf-8; -*- ## This file is part of Invenio. ## Copyright (C) 2007, 2008, 2009, 2010, 2011, 2012, 2013 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

Contents

1. Overview
2. Configuring BibUpload
3. Running BibUpload
       3.1. Inserting new records
       3.2. Inserting records into the Holding Pen
       3.3. Updating existing records
       3.4. Inserting and updating at the same time
       3.5. Updating preformatted output formats
       3.6. Uploading fulltext files
       3.7. Obtaining feedbacks
       3.8. Assigning additional informations to documents and other entities
             3.8.1 Uploading relations between documents
             3.8.2 Using temporary identifiers
4. Batch Uploader
       4.1. Web interface - Cataloguers
       4.1. Web interface - Robots
       4.2. Daemon mode

1. Overview

BibUpload enables you to upload bibliographic data in MARCXML format into Invenio bibliographic database. It is also used internally by other Invenio modules as the sole entrance of metadata into the bibliographic databases.

Note that before uploading a MARCXML file, you may want to run provided /opt/invenio/bin/xmlmarclint on it in order to verify its correctness.

2. Configuring BibUpload

BibUpload takes a MARCXML file as its input. There is nothing to be configured for these files. If the files have to be coverted into MARCXML from some other format, structured or not, this is usually done beforehand via BibConvert module.

Note that if you are using external system numbers for your records, such as when your records are being synchronized from an external system, then BibUpload knows about the tag 970 as the one containing external system number. (To change this 970 tag into something else, you would have to edit BibUpload config source file.)

Note also that in the similar way BibUpload knows about OAI identifiers, so that it will refuse to insert the same OAI harvested record twice, for example.

3. Running BibUpload

3.1 Inserting new records

Consider that you have an MARCXML file containing new records that is to be uploaded into the Invenio. (For example, it might have been produced by BibConvert.) To finish the upload, you would call the BibUpload script in the insert mode as follows:

 $ bibupload -i file.xml
 
 
In the insert mode, all the records from the file will be treated as new. This means that they should not contain neither 001 tags (holding record IDs) nor 970 tags (holding external system numbers). BibUpload would refuse to upload records having these tags, in order to prevent potential double uploading. If your file does contain 001 or 970, then chances are that you want to update existing records, not re-upload them as new, and so BibUpload will warn you about this and will refuse to continue.

For example, to insert a new record, your file should look like this:

     <record>
         <datafield tag="100" ind1=" " ind2=" ">
             <subfield code="a">Doe, John</subfield>
         </datafield>
         <datafield tag="245" ind1=" " ind2=" ">
             <subfield code="a">On The Foo And Bar</subfield>
         </datafield>
     </record>
 

3.2 Inserting records into the Holding Pen

A special mode of BibUpload that is thigthly connected with BibEdit is the Holding Pen mode.

When you insert a record using the holding pen mode such as in the following example:

 $ bibupload -o file.xml
 
the records are not actually integrated into the database, but are instead put into an intermediate space called holding pen, where authorized curators can review them, manipulate them and eventually approve them.

The holding pen is integrated with BibEdit.

3.3 Updating existing records

When you want to update existing records, with the new content from your input MARCXML file, then your input file should contain either tags 001 (holding record IDs) or tag 970 (holding external system numbers). BibUpload will try to match existing records via 001 and 970 and if it finds a record in the database that corresponds to a record from the file, it will update its content. Otherwise it will signal an error saying that it could not find the record-to-be-updated.

For example, to update a title of record #123 via correct mode, your input file should contain record ID in the 001 tag and the title in 245 tag as follows:

     <record>
         <controlfield tag="001">123</controlfield>
         <datafield tag="245" ind1=" " ind2=" ">
             <subfield code="a">My Newly Updated Title</subfield>
         </datafield>
     </record>
 

There are several updating modes:

 
     -r, --replace Replace existing records by those from the XML
                   MARC file.  The original content is wiped out
                   and fully replaced.  Signals error if record
                   is not found via matching record IDs or system
                   numbers.
                   Fields defined in Invenio config variable
                   CFG_BIBUPLOAD_STRONG_TAGS are not replaced.
 
                   Note also that `-r' can be combined with `-i'
                   into an `-ir' option that would automatically
                   either insert records as new if they are not
                   found in the system, or correct existing
                   records if they are found to exist.
 
     -a, --append  Append fields from XML MARC file at the end of
                   existing records.  The original content is
                   enriched only.  Signals error if record is not
                   found via matching record IDs or system
                   numbers.
 
     -c, --correct Correct fields of existing records by those
                   from XML MARC file.  The original record
                   content is modified only on those fields from
                   the XML MARC file where both the tags and the
                   indicators match: the original fields are
                   removed and replaced by those from the XML
                   MARC file.  Fields not present in XML MARC
                   file are not changed (unlike the -r option).
                   Fields with "provenance" subfields defined in
                   'CFG_BIBUPLOAD_CONTROLLED_PROVENANCE_TAGS'
                   are protected against deletion unless the
                   input MARCXML contains a matching
                   provenance value.
                   Signals error if record is not found via
                   matching record IDs or system numbers.
 
     -d, --delete  Delete fields of existing records that are
                   contained in the XML MARC file. The fields in
                   the original record that are not present in
                   the XML MARC file are preserved.
                   This is incompatible with FFT (see below).
 

Note that if you are using the --replace mode, and you specify in the incoming MARCXML a 001 tag with a value representing a record ID that does not exist, bibupload will not create the record on-the-fly unless the --force parameter was also passed on the command line. This is done in order to avoid creating, by mistake, holes in the database list of record identifiers. When you ask, in fact, to --replace a non-existing record imposing a record ID with a value of, say, 1 000 000 and, subsequently, you --insert a new record, this will automatically receive an ID with the value 1 000 001.

If you combine the --pretend parameter with the above updating modes you can actually test what would be executed without modifying the database or altering the system status.

3.4 Inserting and updating at the same time

Note that the insert/update modes can be combined together. For example, if you have a file that contains a mixture of new records with possibly some records to be updated, then you can run:

 $ bibupload -i -r file.xml
 
 
In this case BibUpload will try to do an update (for records having either 001 or 970 identifiers), or an insert (for the other ones).

3.6 Uploading fulltext files

The fulltext files can be uploaded and revised via a special FFT ("fulltext file transfer") tag with the following semantic:

     FFT $a  ...  location of the docfile to upload (a filesystem path or a URL)
         $d  ...  docfile description (optional)
         $f  ...  format (optional; if not set, deduced from $a)
         $m  ...  new desired docfile name (optional; used for renaming files)
         $n  ...  docfile name (optional; if not set, deduced from $a)
         $o  ...  flag (repeatable subfield)
         $r  ...  restriction (optional, see below)
         $s  ...  set timestamp (optional, see below)
         $t  ...  docfile type (e.g. Main, Additional)
         $v  ...  version (used only with REVERT and DELETE-FILE, see below)
         $x  ...  url/path for an icon (optional)
         $z  ...  comment (optional)
         $w  ... MoreInfo modification of the document
         $p  ... MoreInfo modification of a current version of the document
         $b  ... MoreInfo modification of a current version and format of the document
         $u  ... MoreInfo modification of a format (of any version) of the document
 

For example, to upload a new fulltext file thesis.pdf associated to record ID 123:

     <record>
         <controlfield tag="001">123</controlfield>
         <datafield tag="FFT" ind1=" " ind2=" ">
             <subfield code="a">/tmp/thesis.pdf</subfield>
             <subfield code="t">Main</subfield>
             <subfield code="d">
               This is the fulltext version of my thesis in the PDF format.
               Chapter 5 still needs some revision.
             </subfield>
         </datafield>
     </record>
 

The FFT tag can be repetitive, so one can pass along another FFT tag instance containing a pointer to e.g. the thesis defence slides. The subfields of an FFT tag are non-repetitive.

When more than one FFT tag is specified for the same document (e.g. for adding more than one format at a time), if $t (docfile type), $m (new desired docfile name), $r (restriction), $v (version), $x (url/path for an icon), are specified, they should be identically specified for each single entry of FFT. E.g. if you want to specify an icon for a document with two formats (say .pdf and .doc), you'll write two FFT tags, both containing the same $x subfield.

The bibupload process, when it encounters FFT tags, will automatically populate fulltext storage space (/opt/invenio/var/data/files) and metadata record associated tables (bibrec_bibdoc, bibdoc) as appropriate. It will also enrich the 856 tags (URL tags) of the MARC metadata of the record in question with references to the latest versions of each file.

Note that for $a and $x subfields filesystem paths must be absolute (e.g. /tmp/icon.gif is valid, while Destkop/icon.gif is not) and they must be readable by the user/group of the bibupload process that will handle the FFT.

The bibupload process supports the usual modes correct, append, replace, insert with a semantic that is somewhat similar to the semantic of the metadata upload:

Metadata Fulltext
objects being uploaded MARC field instances characterized by tags (010-999) fulltext files characterized by unique file names (FFT $n)
insert insert new record; must not exist insert new files; must not exist
append append new tag instances for the given tag XXX, regardless of existing tag instances append new files, if filename (i.e. new format) not already present
correct correct tag instances for the given tag XXX; delete existing ones and replace with given ones correct files with the given filename; add new revision or delete file; if the docname does not exist the file is added
replace replace all tags, whatever XXX are replace all files, whatever filenames are
delete delete all existing tag instances not supported

Note that you can mix regular MARC tags with special FFT tags in the incoming XML input file. Both record metadata and record files will be updated as a result. Hence beware with some input modes, such as replace mode, if you would like to touch only files.

Note that in append and insert mode the $m is ignored.

In order to rename a document just use the the correct mode specifing in the $n subfield the original docname that should be renamed and in $m the new name.

Special values can be assigned to the $t subfield.

ValueMeaning
PURGEIn order to purge previous file revisions (i.e. in order to keep only the latest file version), please use the correct mode with $n docname and $t PURGE as the special keyword.
DELETEIn order to delete all existing versions of a file, making it effectively hidden, please use the correct mode with $n docname and $t DELETE as the special keyword.
EXPUNGEIn order to expunge (i.e. remove completely, also from the filesystem) all existing versions of a file, making it effectively disappear, please use the correct mode with $n docname and $t EXPUNGE as the special keyword.
FIX-MARCIn order to synchronize MARC to the bibrec/bibdoc structure (e.g. after an update or a tweak in the database), please use the correct mode with $n docname and $t FIX-MARC as the special keyword.
FIX-ALLIn order to fix a record (i.e. put all its linked documents in a coherent state) and synchronize the MARC to the table, please use the correct mode with $n docname and $t FIX-ALL as the special keyword.
REVERTIn order to revert to a previous file revision (i.e. to create a new revision with the same content as some previous revision had), please use the correct mode with $n docname, $t REVERT as the special keyword and $v the number corresponding to the desired version.
DELETE-FILEIn order to delete a particular file added by mistake, please use the correct mode with $n docname, $t DELETE-FILE, specifing $v version and $f format. Note that this operation is not reversible. Note that if you don't spcify a version, the last version will be used.

In order to preserve previous comments and descriptions when correcting, please use the KEEP-OLD-VALUE special keyword with the desired $d and $z subfield.

The $r subfield can contain a string that can be use to restrict the given document. The same value must be specified for all the format of a given document. By default the keyword will be used as the status parameter for the "viewrestrdoc" action, which can be used to give access right/restriction to desired user. e.g. if you set the keyword "thesis", you can the connect the "thesisviewer" to the action "viewrestrdoc" with parameter "status" set to "thesis". Then all the user which are linked with the "thesisviewer" role will be able to download the document. Instead any other user which are not considered as authors for the given record will not be allowed. Note, if you use the keyword "KEEP-OLD-VALUE" the previous restrictions if applicable will be kept.

More advanced document-level restriction is indeed possible. If the value contains infact:

Note, that authors (as defined in the record MARC) and superadmin are always authorized to access a document, no matter what is the given value of the status.

Some special flags might be set via FFT and associated with the current document by using the $o subfield. This feature is experimental. Currently only two flags are actively considered:

Note that each time bibupload is called on a record, the 8564 tags pointing to locally stored files are recreated on the basis of the full-text files connected to the record. Thus, if you whish to update some 8564 tag pointing to a locally managed file, the only way to perform this is through the FFT tag, not by editing 8564 directly.

The subfield $s of FFT can be used to set time stamp of the uploaded file to a given value, e.g. 2007-05-04 03:02:01. This is useful when uploading old files. When $s is not present, the current time will be used.

3.7 Obtaining feedbacks

Sometimes, to implement a particular workflow or policy in a digital repository, it might be nice to receive an automatic machine friendly feedback that aknowledges the outcome of a bibupload execution. To this aim the --callback-url command line parameter can be used. This parameter expects a URL to be specified to which a JSON-serialized response will POSTed.

Say, you have an external service reachable via the URL http://www.example.org/accept_feedback. If the argument:

 --callback-url http://www.example.org/accept_feedback
 
is added to the usual bibupload call, at the end of the execution of the corresponding bibupload task, an HTTP POST request will be performed, if possible to the given URL, reporting the outcome of the bibupload execution as a JSON-serialized response with the following structure:

For example, a possible JSON response posted to a specified URL can look like:

 {
     "results": [
         {
             "recid": -1,
             "error_message": "ERROR: can not retrieve the record identifier",
             "success": false
         },
         {
             "recid": 1000,
             "error_message": "",
             "success": true,
             "marcxml": "1000...",
             "url": "http://www.example.org/record/1000"
         },
         ...
     ]
 }
 

Note that, currently, in case the specified URL can not be reached at the time of the POST request, the whole bibupload task will fail.

If you use the same callback URL to receive the feedback from more than one bibupload request you might want to be able to correctly identify each bibupload call with the corresponding feedback. For this reason you can pass to the bibupload call an additional argument:

 --nonce VALUE
 
where value can be any string you wish. Such string will be then added to the JSON structure, as in (supposing you specified --nonce 1234):
 {
     "nonce": "1234",
     "results": [
         {
             "recid": -1,
             "error_message": "ERROR: can not retrieve the record identifier",
             "success": false
         },
         {
             "recid": 1000,
             "error_message": "",
             "success": true,
             "marcxml": "1000...",
             "url": "http://www.example.org/record/1000"
         },
         ...
     ]
 }
 

3.8 Assigning additional informations to documents and other entities

Some bits of meta-data should not be viewed by Invenio users directly and stored in the MARC format. This includes all types of non-standard data related to records and documents, for example flags realted to documetns (sepcified inside of a FFT tage) or bits of semantic information related to entities managed in Invenio. This type of data is usually machine generated and should be used by modules of Invenio internally.

Invenio provides a general mechanism allowing to store objects related to different entities of Invenio. This mechanism is called MoreInfo and resembles well known more-info solutions. Every entity (document, version of a document, format of a particular version of a document, relation between documents) can be assigned a dictionary of arbitrary values. The dictionary is divided into namespaces, which allow to separate data from different modules and serving different purposes.

BibUpload, the only gateway to uploading data into the Invenio database, allows to populate MoreInfo structures. MoreInfo related to a given entity can be modified by providing a Pickle-serialised byte64 encoded Python object having following structure:

 {
     "namespace": {
         "key": "value",
        	"key2": "value2"
     }
 }
 

For example the above dictionary should be uploaded as

KGRwMQpTJ25hbWVzcGFjZScKcDIKKGRwMwpTJ2tleTInCnA0ClMndmFsdWUyJwpwNQpzUydrZXknCnA2ClMndmFsdWUnCnA3CnNzLg==

Which is a base-64 encoded representation of the string

(dp0\nS'namespace'\np1\n(dp2\nS'key2'\np3\nS'value2'\np4\nsS'key'\np5\nS'value'\np6\nss.

Removing of data keys from a dictionary can happen by providing None value as a value. Empty namespaces are considered non-existent.

The string representation of modifications to the MoreInfo dictionary can be provided in several places, depending, to which object it should be attached. The most general upload method, the BDM tag has following semantic:

     BDM $r  ... Identifier of a relation between documents (optional)
         $i  ... Identifier of a BibDoc (optional)
         $v  ... Version of a BibDoc (optional)
         $n  ... Name of a BibDoc (within a current record) (optional)
         $f  ... Format of a BibDoc (optional)
         $m  ... Serialised update to the MoreInfo dictionary
 

All (except $m) subfields are optional and allow to identify an entity to which MoreInfo should refer.

Besides the BDM tag, MoreInfo can be transfered using special subfields of FFT and BDR tags. The first one allows to modify MoreInfo of a newly uploaded document, the second of a relation. The additional subfields have following semantic:

     FFT $w  ... MoreInfo modification of the document
         $p  ... MoreInfo modification of a current version of the document
         $s  ... MoreInfo modification of a current version and format of the document
         $u  ... MoreInfo modification of a format (of any version) of the document
     BDR $m  ... MoreInfo modification of a relation between BibDocs
 

3.8.1 Uploading relations between documents

One of additional pieces of non-MARC data which can be uploaded to Invenio are relations between documents. Similarly to MoreInfos, relations are intended to be used by Invenio modules. The semantics of BDR field allowing to upload relations looks as follows

     BDR $r  ... Identifier of the relation (optional, can be provided if modifying a known relation)
 
         $i  ... Identifier of the first document
         $n  ... Name of the first document (within the current record) (optional)
         $v  ... Version of the first document (optional)
         $f  ... Format of the first document (optional)
 
         $j  ... Identifier of the second document
         $o  ... Name of the second document (within the current record) (optional)
         $w  ... Version of the second document (optional)
         $g  ... Format of the second document (optional)
 
         $t  ... Type of the relation
         $m  ... Modification of the MoreInfo of the relation
         $d  ... Special field. if value=DELETE, relation is removed
 

Behavious of BDR tag in different upload modes:

insert, appendInserts new relation if necessary. Appends fields to the MoreInfo structure
correct, replaceCreates new relation if necessary, replaces the entire content of MoreInfo field.

3.8.2 Using temporary identifiers

In many cases, users want to upload large collections of documents using single BibUpload tasks. The infrastructure described in the rest of this manual allows easy upload of multiple documents, but lacks facilities for relating them to each other. A sample use-case which can not be satisfied by simple usage of FFT tags is uploading a document and relating it to another which is either already in the database or is being uploaded within the same BibUpload task. BibUpload provides a mechanism of temportaty identifiers which allows to serve scenarios similar to the aforementioned.

Temporary identifier is a string (unique in the context of a single MARC XML document), which replaces document number or a version number. In the context of BibDoc manipulations (FFT, BDR and BDM tags), temporary identifeirs can appear everywhere where version or numerical id are required. If a temporary identifier appears in a context of document already having an ID assigned, it will be interpreted as this already existent number. If newly created document is assigned a temporary identifier, the newly generated numerical ID is assigned to the temporary id. In order to be recognised as a temporary identifier, a string has to begin with a prefix TMP:. The mechanism of temporary identifiers can not be used in the con text of records, but only with BibDocs.

A BibUpload input using temporary identifiers can look like:

 
 <collection xmlns="http://www.loc.gov/MARC21/slim">
   <record>
     <datafield tag="100" ind1=" " ind2=" ">
       <subfield code="a">This is a record of the publication</subfield>
     </datafield>
     <datafield tag="FFT" ind1=" " ind2=" ">
       <subfield code="a">http://somedomain.com/document.pdf</subfield>
       <subfield code="t">Main</subfield>
       <subfield code="n">docname</subfield>
       <subfield code="i">TMP:id_identifier1</subfield>
       <subfield code="v">TMP:ver_identifier1</subfield>
     </datafield>
   </record>
 
   <record>
     <datafield tag="100" ind1=" " ind2=" ">
       <subfield code="a">This is a record of a dataset extracted from the publication</subfield>
     </datafield>
 
     <datafield tag="FFT" ind1=" " ind2=" ">
       <subfield code="a">http://sample.com/dataset.data</subfield>
       <subfield code="t">Main</subfield>
       <subfield code="n">docname2</subfielxd>
       <subfield code="i">TMP:id_identifier2</subfield>
       <subfield code="v">TMP:ver_identifier2</subfield>
     </datafield>
 
     <datafield tag="BDR" ind1=" " ind2=" ">
       <subfield code="i">TMP:id_identifier1</subfield>
       <subfield code="v">TMP:ver_identifier1</subfield>
       <subfield code="j">TMP:id_identifier2</subfield>
       <subfield code="w">TMP:ver_identifier2</subfield>
 
       <subfield code="t">is_extracted_from</subfield>
     </datafield>
   </record>
 
 </collection>
 

4. Batch Uploader

4.1 Web interface - Cataloguers

The batchuploader web interface can be used either to upload metadata files or documents. Opposed to daemon mode, actions will be executed only once.

The available upload history displays metadata and document uploads using the web interface, not daemon mode.

4.2 Web interface - Robots

If it is needed to use the batch upload function from within command line, this can be achieved with a curl call, like:

 $ curl -F 'file=@localfile.xml' -F 'mode=-i' http://cds.cern.ch/batchuploader/robotupload [-F 'callback_url=http://...'] -A invenio_webupload
 
 

This service provides (client, file) checking to assure the records are put into a collection the client has rights to.
To configure this permissions, check CFG_BATCHUPLOADER_WEB_ROBOT_RIGHTS variable in the configuration file.
The allowed user agents can also be defined using the CFG_BATCHUPLOADER_WEB_ROBOT_AGENT variable.

Note that you can receive machine-friendly feedbacks from the corresponding bibupload task that is launched by a given batchuploader request, by adding the optional POST field callback_url with the same semantic of the --callback-url command line parameter of bibupload (see the previous paragraph Obtaining feedbacks).

A second more RESTful interface is also available: it will suffice to append to the URL the specific mode (among "insert", -"append", "correct", "delete", "replace"), as in: +"append", "correct", "delete", "replace", "insertorreplace"), as in:

 http://cds.cern.ch/batchuploader/robotupload/insert
 

The callback_url argument can be put in query part of the URL as in:

 http://cds.cern.ch/batchuploader/robotupload/insert?callback_url=http://myhandler
 

In case the HTTP server that is going to receive the feedback at callback_url expect the request to be encoded in application/x-www-form-urlencoded rather than application/json (e.g. if the server is implemented directly in Oracle), you can further specify the special_treatment argument and set it to oracle. The feedback will then be further encoded into an application/x-www-form-urlencoded request, with a single form key called results, which will contain the final JSON data.

The MARCXML content should then be specified as the body of the request. With curl this can be implemented as in:

 $ curl -T localfile.xml http://cds.cern.ch/batchuploader/robotupload/insert?callback_url=http://... -A invenio_webupload -H "Content-Type: application/marcxml+xml"
 

The nonce argument that can be passed to BibUpload as described in the previous paragraph can also be specified with both robotupload interfaces. E.g.:

 $ curl -F 'file=@localfile.xml' -F 'nonce=1234' -F 'mode=-i' http://cds.cern.ch/batchuploader/robotupload -F 'callback_url=http://...' -A invenio_webupload
 
and
 $ curl -T localfile.xml http://cds.cern.ch/batchuploader/robotupload/insert?nonce=1234&callback_url=http://... -A invenio_webupload -H "Content-Type: application/marcxml+xml"
 

4.2 Daemon mode

The batchuploader daemon mode is intended to be a bibsched task for document or metadata upload. The parent directory where the daemon will look for folders metadata and documents must be specified in the invenio configuration file.

An example of how directories should be arranged, considering that invenio was installed in folder /opt/invenio would be:

      /opt/invenio/var/batchupload
             /opt/invenio/var/batchupload/documents
                     /opt/invenio/var/batchupload/documents/append
                     /opt/invenio/var/batchupload/documents/revise
             /opt/invenio/var/batchupload/metadata
                     /opt/invenio/var/batchupload/metadata/append
                     /opt/invenio/var/batchupload/metadata/correct
                     /opt/invenio/var/batchupload/metadata/insert
                     /opt/invenio/var/batchupload/metadata/replace
 

When running the batchuploader daemon there are two possible execution modes:

         -m,   --metadata    Look for metadata files in folders insert, append, correct and replace.
                             All files are uploaded and then moved to the corresponding DONE folder.
         -d,   --documents   Look for documents in folders append and revise. Uploaded files are then
                             moved to DONE folders if possible.
 
By default, metadata mode is used.

An example of invocation would be:

 $ batchuploader --documents
 
 

It is possible to program batch uploader to run periodically. Read the Howto-run guide to see how. diff --git a/modules/bibupload/lib/batchuploader_engine.py b/modules/bibupload/lib/batchuploader_engine.py index 7a70b00c9..41daee9b6 100644 --- a/modules/bibupload/lib/batchuploader_engine.py +++ b/modules/bibupload/lib/batchuploader_engine.py @@ -1,685 +1,687 @@ # -*- coding: utf-8 -*- ## ## This file is part of Invenio. ## Copyright (C) 2010, 2011, 2012, 2013 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """ Batch Uploader core functions. Uploading metadata and documents. """ import os import pwd import grp import sys import time import tempfile import cgi import re from invenio.dbquery import run_sql, Error from invenio.access_control_engine import acc_authorize_action from invenio.webuser import collect_user_info, page_not_authorized from invenio.config import CFG_BINDIR, CFG_TMPSHAREDDIR, CFG_LOGDIR, \ CFG_BIBUPLOAD_EXTERNAL_SYSNO_TAG, \ CFG_BIBUPLOAD_EXTERNAL_OAIID_TAG, \ CFG_OAI_ID_FIELD, CFG_BATCHUPLOADER_DAEMON_DIR, \ CFG_BATCHUPLOADER_WEB_ROBOT_RIGHTS, \ CFG_BATCHUPLOADER_WEB_ROBOT_AGENTS, \ CFG_PREFIX, CFG_SITE_LANG from invenio.textutils import encode_for_xml from invenio.bibtask import task_low_level_submission from invenio.messages import gettext_set_language from invenio.textmarc2xmlmarc import transform_file from invenio.shellutils import run_shell_command from invenio.bibupload import xml_marc_to_records, bibupload from invenio.access_control_firerole import _ip_matcher_builder, _ipmatch import invenio.bibupload as bibupload_module from invenio.bibrecord import create_records, \ record_strip_empty_volatile_subfields, \ record_strip_empty_fields try: from cStringIO import StringIO except ImportError: from StringIO import StringIO PERMITTED_MODES = ['-i', '-r', '-c', '-a', '-ir', '--insert', '--replace', '--correct', '--append'] _CFG_BATCHUPLOADER_WEB_ROBOT_AGENTS_RE = re.compile(CFG_BATCHUPLOADER_WEB_ROBOT_AGENTS) _CFG_BATCHUPLOADER_WEB_ROBOT_RIGHTS = [] for _network, _collection in CFG_BATCHUPLOADER_WEB_ROBOT_RIGHTS.items(): if '/' not in _network: _network += '/32' _CFG_BATCHUPLOADER_WEB_ROBOT_RIGHTS.append((_ip_matcher_builder(_network), _collection)) del _network del _collection def cli_allocate_record(req): req.content_type = "text/plain" req.send_http_header() # check IP and useragent: if not _get_client_authorized_collections(_get_client_ip(req)): msg = "[ERROR] Sorry, client IP %s cannot use the service." % _get_client_ip(req) _log(msg) return _write(req, msg) if not _check_client_useragent(req): msg = '[ERROR] Sorry, the "%s" useragent cannot use the service.' % _get_useragent(req) _log(msg) return _write(req, msg) recid = run_sql("insert into bibrec (creation_date,modification_date) values(NOW(),NOW())") return recid def cli_upload(req, file_content=None, mode=None, callback_url=None, nonce=None, special_treatment=None): """ Robot interface for uploading MARC files """ req.content_type = "text/plain" req.send_http_header() # check IP and useragent: if not _get_client_authorized_collections(_get_client_ip(req)): msg = "[ERROR] Sorry, client IP %s cannot use the service." % _get_client_ip(req) _log(msg) return _write(req, msg) if not _check_client_useragent(req): msg = "[ERROR] Sorry, the %s useragent cannot use the service." % _get_useragent(req) _log(msg) return _write(req, msg) arg_mode = mode if not arg_mode: msg = "[ERROR] Please specify upload mode to use." _log(msg) return _write(req, msg) + if arg_mode == '--insertorreplace': + arg_mode = '-ir' if not arg_mode in PERMITTED_MODES: msg = "[ERROR] Invalid upload mode." _log(msg) return _write(req, msg) arg_file = file_content if hasattr(arg_file, 'read'): ## We've been passed a readable file, e.g. req arg_file = arg_file.read() if not arg_file: msg = "[ERROR] Please provide a body to your request." _log(msg) return _write(req, msg) else: if not arg_file: msg = "[ERROR] Please specify file body to input." _log(msg) return _write(req, msg) if hasattr(arg_file, "filename"): arg_file = arg_file.value else: msg = "[ERROR] 'file' parameter must be a (single) file" _log(msg) return _write(req, msg) # write temporary file: (fd, filename) = tempfile.mkstemp(prefix="batchupload_" + \ time.strftime("%Y%m%d%H%M%S", time.localtime()) + "_", dir=CFG_TMPSHAREDDIR) filedesc = os.fdopen(fd, 'w') filedesc.write(arg_file) filedesc.close() # check if this client can run this file: client_ip = _get_client_ip(req) permitted_dbcollids = _get_client_authorized_collections(client_ip) if '*' not in permitted_dbcollids: # wildcard allow = _check_client_can_submit_file(client_ip, filename, req, 0) if not allow: msg = "[ERROR] Cannot submit such a file from this IP. (Wrong collection.)" _log(msg) return _write(req, msg) # check validity of marcxml xmlmarclint_path = CFG_BINDIR + '/xmlmarclint' xmlmarclint_output, dummy1, dummy2 = run_shell_command('%s %s' % (xmlmarclint_path, filename)) if xmlmarclint_output != 0: msg = "[ERROR] MARCXML is not valid." _log(msg) return _write(req, msg) args = ['bibupload', "batchupload", arg_mode, filename] # run upload command if callback_url: args += ["--callback-url", callback_url] if nonce: args += ["--nonce", nonce] if special_treatment: args += ["--special-treatment", special_treatment] task_low_level_submission(*args) msg = "[INFO] %s" % ' '.join(args) _log(msg) return _write(req, msg) def metadata_upload(req, metafile=None, filetype=None, mode=None, exec_date=None, exec_time=None, metafilename=None, ln=CFG_SITE_LANG, priority="1", email_logs_to=None): """ Metadata web upload service. Get upload parameters and exec bibupload for the given file. Finally, write upload history. @return: tuple (error code, message) error code: code that indicates if an error ocurred message: message describing the error """ # start output: req.content_type = "text/html" req.send_http_header() error_codes = {'not_authorized': 1} user_info = collect_user_info(req) (fd, filename) = tempfile.mkstemp(prefix="batchupload_" + \ user_info['nickname'] + "_" + time.strftime("%Y%m%d%H%M%S", time.localtime()) + "_", dir=CFG_TMPSHAREDDIR) filedesc = os.fdopen(fd, 'w') filedesc.write(metafile) filedesc.close() # check if this client can run this file: if req is not None: allow = _check_client_can_submit_file(req=req, metafile=metafile, webupload=1, ln=ln) if allow[0] != 0: return (error_codes['not_authorized'], allow[1]) # run upload command: task_arguments = ('bibupload', user_info['nickname'], mode, "--priority=" + priority, "-N", "batchupload") if exec_date: date = exec_date if exec_time: date += ' ' + exec_time task_arguments += ("-t", date) if email_logs_to: task_arguments += ('--email-logs-to', email_logs_to) task_arguments += (filename, ) jobid = task_low_level_submission(*task_arguments) # write batch upload history run_sql("""INSERT INTO hstBATCHUPLOAD (user, submitdate, filename, execdate, id_schTASK, batch_mode) VALUES (%s, NOW(), %s, %s, %s, "metadata")""", (user_info['nickname'], metafilename, exec_date != "" and (exec_date + ' ' + exec_time) or time.strftime("%Y-%m-%d %H:%M:%S"), str(jobid), )) return (0, "Task %s queued" % str(jobid)) def document_upload(req=None, folder="", matching="", mode="", exec_date="", exec_time="", ln=CFG_SITE_LANG, priority="1", email_logs_to=None): """ Take files from the given directory and upload them with the appropiate mode. @parameters: + folder: Folder where the files to upload are stored + matching: How to match file names with record fields (report number, barcode,...) + mode: Upload mode (append, revise, replace) @return: tuple (file, error code) file: file name causing the error to notify the user error code: 1 - More than one possible recID, ambiguous behaviour 2 - No records match that file name 3 - File already exists """ import sys if sys.hexversion < 0x2060000: from md5 import md5 else: from hashlib import md5 from invenio.bibdocfile import BibRecDocs, file_strip_ext import shutil from invenio.search_engine import perform_request_search, \ search_pattern, \ guess_collection_of_a_record _ = gettext_set_language(ln) errors = [] info = [0, []] # Number of files read, name of the files try: files = os.listdir(folder) except OSError, error: errors.append(("", error)) return errors, info err_desc = {1: _("More than one possible recID, ambiguous behaviour"), 2: _("No records match that file name"), 3: _("File already exists"), 4: _("A file with the same name and format already exists"), 5: _("No rights to upload to collection '%s'")} # Create directory DONE/ if doesn't exist folder = (folder[-1] == "/") and folder or (folder + "/") files_done_dir = folder + "DONE/" try: os.mkdir(files_done_dir) except OSError: # Directory exists or no write permission pass for docfile in files: if os.path.isfile(os.path.join(folder, docfile)): info[0] += 1 identifier = file_strip_ext(docfile) extension = docfile[len(identifier):] rec_id = None if identifier: rec_id = search_pattern(p=identifier, f=matching, m='e') if not rec_id: errors.append((docfile, err_desc[2])) continue elif len(rec_id) > 1: errors.append((docfile, err_desc[1])) continue else: rec_id = str(list(rec_id)[0]) rec_info = BibRecDocs(rec_id) if rec_info.bibdocs: for bibdoc in rec_info.bibdocs: attached_files = bibdoc.list_all_files() file_md5 = md5(open(os.path.join(folder, docfile), "rb").read()).hexdigest() num_errors = len(errors) for attached_file in attached_files: if attached_file.checksum == file_md5: errors.append((docfile, err_desc[3])) break elif attached_file.get_full_name() == docfile: errors.append((docfile, err_desc[4])) break if len(errors) > num_errors: continue # Check if user has rights to upload file if req is not None: file_collection = guess_collection_of_a_record(int(rec_id)) auth_code, auth_message = acc_authorize_action(req, 'runbatchuploader', collection=file_collection) if auth_code != 0: error_msg = err_desc[5] % file_collection errors.append((docfile, error_msg)) continue # Move document to be uploaded to temporary folder (fd, tmp_file) = tempfile.mkstemp(prefix=identifier + "_" + time.strftime("%Y%m%d%H%M%S", time.localtime()) + "_", suffix=extension, dir=CFG_TMPSHAREDDIR) shutil.copy(os.path.join(folder, docfile), tmp_file) # Create MARC temporary file with FFT tag and call bibupload (fd, filename) = tempfile.mkstemp(prefix=identifier + '_', dir=CFG_TMPSHAREDDIR) filedesc = os.fdopen(fd, 'w') marc_content = """ %(rec_id)s %(name)s %(path)s """ % {'rec_id': rec_id, 'name': encode_for_xml(identifier), 'path': encode_for_xml(tmp_file), } filedesc.write(marc_content) filedesc.close() info[1].append(docfile) user = "" if req is not None: user_info = collect_user_info(req) user = user_info['nickname'] if not user: user = "batchupload" # Execute bibupload with the appropiate mode task_arguments = ('bibupload', user, "--" + mode, "--priority=" + priority, "-N", "batchupload") if exec_date: date = '--runtime=' + "\'" + exec_date + ' ' + exec_time + "\'" task_arguments += (date, ) if email_logs_to: task_arguments += ("--email-logs-to", email_logs_to) task_arguments += (filename, ) jobid = task_low_level_submission(*task_arguments) # write batch upload history run_sql("""INSERT INTO hstBATCHUPLOAD (user, submitdate, filename, execdate, id_schTASK, batch_mode) VALUES (%s, NOW(), %s, %s, %s, "document")""", (user_info['nickname'], docfile, exec_date != "" and (exec_date + ' ' + exec_time) or time.strftime("%Y-%m-%d %H:%M:%S"), str(jobid))) # Move file to DONE folder done_filename = docfile + "_" + time.strftime("%Y%m%d%H%M%S", time.localtime()) + "_" + str(jobid) try: os.rename(os.path.join(folder, docfile), os.path.join(files_done_dir, done_filename)) except OSError: errors.append('MoveError') return errors, info def get_user_metadata_uploads(req): """Retrieve all metadata upload history information for a given user""" user_info = collect_user_info(req) upload_list = run_sql("""SELECT DATE_FORMAT(h.submitdate, '%%Y-%%m-%%d %%H:%%i:%%S'), \ h.filename, DATE_FORMAT(h.execdate, '%%Y-%%m-%%d %%H:%%i:%%S'), \ s.status \ FROM hstBATCHUPLOAD h INNER JOIN schTASK s \ ON h.id_schTASK = s.id \ WHERE h.user=%s and h.batch_mode="metadata" ORDER BY h.submitdate DESC""", (user_info['nickname'],)) return upload_list def get_user_document_uploads(req): """Retrieve all document upload history information for a given user""" user_info = collect_user_info(req) upload_list = run_sql("""SELECT DATE_FORMAT(h.submitdate, '%%Y-%%m-%%d %%H:%%i:%%S'), \ h.filename, DATE_FORMAT(h.execdate, '%%Y-%%m-%%d %%H:%%i:%%S'), \ s.status \ FROM hstBATCHUPLOAD h INNER JOIN schTASK s \ ON h.id_schTASK = s.id \ WHERE h.user=%s and h.batch_mode="document" ORDER BY h.submitdate DESC""", (user_info['nickname'],)) return upload_list def get_daemon_doc_files(): """ Return all files found in batchuploader document folders """ files = {} for folder in ['/revise', '/append']: try: daemon_dir = CFG_BATCHUPLOADER_DAEMON_DIR[0] == '/' and CFG_BATCHUPLOADER_DAEMON_DIR \ or CFG_PREFIX + '/' + CFG_BATCHUPLOADER_DAEMON_DIR directory = daemon_dir + '/documents' + folder files[directory] = [(filename, []) for filename in os.listdir(directory) if os.path.isfile(os.path.join(directory, filename))] for file_instance, info in files[directory]: stat_info = os.lstat(os.path.join(directory, file_instance)) info.append("%s" % pwd.getpwuid(stat_info.st_uid)[0]) # Owner info.append("%s" % grp.getgrgid(stat_info.st_gid)[0]) # Group info.append("%d" % stat_info.st_size) # Size time_stat = stat_info.st_mtime time_fmt = "%Y-%m-%d %R" info.append(time.strftime(time_fmt, time.gmtime(time_stat))) # Last modified except OSError: pass return files def get_daemon_meta_files(): """ Return all files found in batchuploader metadata folders """ files = {} for folder in ['/correct', '/replace', '/insert', '/append']: try: daemon_dir = CFG_BATCHUPLOADER_DAEMON_DIR[0] == '/' and CFG_BATCHUPLOADER_DAEMON_DIR \ or CFG_PREFIX + '/' + CFG_BATCHUPLOADER_DAEMON_DIR directory = daemon_dir + '/metadata' + folder files[directory] = [(filename, []) for filename in os.listdir(directory) if os.path.isfile(os.path.join(directory, filename))] for file_instance, info in files[directory]: stat_info = os.lstat(os.path.join(directory, file_instance)) info.append("%s" % pwd.getpwuid(stat_info.st_uid)[0]) # Owner info.append("%s" % grp.getgrgid(stat_info.st_gid)[0]) # Group info.append("%d" % stat_info.st_size) # Size time_stat = stat_info.st_mtime time_fmt = "%Y-%m-%d %R" info.append(time.strftime(time_fmt, time.gmtime(time_stat))) # Last modified except OSError: pass return files def user_authorization(req, ln): """ Check user authorization to visit page """ auth_code, auth_message = acc_authorize_action(req, 'runbatchuploader') if auth_code != 0: referer = '/batchuploader/' return page_not_authorized(req=req, referer=referer, text=auth_message, navmenuid="batchuploader") else: return None def perform_basic_upload_checks(xml_record): """ Performs tests that would provoke the bibupload task to fail with an exit status 1, to prevent batchupload from crashing while alarming the user wabout the issue """ from invenio.bibupload import writing_rights_p errors = [] if not writing_rights_p(): errors.append("Error: BibUpload does not have rights to write fulltext files.") recs = create_records(xml_record, 1, 1) if recs == []: errors.append("Error: Cannot parse MARCXML file.") elif recs[0][0] is None: errors.append("Error: MARCXML file has wrong format: %s" % recs) return errors def perform_upload_check(xml_record, mode): """ Performs a upload simulation with the given record and mode @return: string describing errors @rtype: string """ error_cache = [] def my_writer(msg, stream=sys.stdout, verbose=1): if verbose == 1: if 'DONE' not in msg: error_cache.append(msg.strip()) orig_writer = bibupload_module.write_message bibupload_module.write_message = my_writer error_cache.extend(perform_basic_upload_checks(xml_record)) if error_cache: # There has been some critical error return '\n'.join(error_cache) recs = xml_marc_to_records(xml_record) try: upload_mode = mode[2:] # Adapt input data for bibupload function if upload_mode == "r insert-or-replace": upload_mode = "replace_or_insert" for record in recs: if record: record_strip_empty_volatile_subfields(record) record_strip_empty_fields(record) bibupload(record, opt_mode=upload_mode, pretend=True) finally: bibupload_module.write_message = orig_writer return '\n'.join(error_cache) def _get_useragent(req): """Return client user agent from req object.""" user_info = collect_user_info(req) return user_info['agent'] def _get_client_ip(req): """Return client IP address from req object.""" return str(req.remote_ip) def _get_client_authorized_collections(client_ip): """ Is this client permitted to use the service? Return list of collections for which the client is authorized """ ret = [] for network, collection in _CFG_BATCHUPLOADER_WEB_ROBOT_RIGHTS: if _ipmatch(client_ip, network): if '*' in collection: return ['*'] ret += collection return ret def _check_client_useragent(req): """ Is this user agent permitted to use the service? """ client_useragent = _get_useragent(req) if _CFG_BATCHUPLOADER_WEB_ROBOT_AGENTS_RE.match(client_useragent): return True return False def _check_client_can_submit_file(client_ip="", metafile="", req=None, webupload=0, ln=CFG_SITE_LANG): """ Is this client able to upload such a FILENAME? check 980 $a values and collection tags in the file to see if they are among the permitted ones as specified by CFG_BATCHUPLOADER_WEB_ROBOT_RIGHTS and ACC_AUTHORIZE_ACTION. Useful to make sure that the client does not override other records by mistake. """ _ = gettext_set_language(ln) recs = create_records(metafile, 0, 0) user_info = collect_user_info(req) permitted_dbcollids = _get_client_authorized_collections(client_ip) if '*' in permitted_dbcollids: return True filename_tag980_values = _detect_980_values_from_marcxml_file(recs) for filename_tag980_value in filename_tag980_values: if not filename_tag980_value: if not webupload: return False else: return(1, "Invalid collection in tag 980") if not webupload: if not filename_tag980_value in permitted_dbcollids: return False else: auth_code, auth_message = acc_authorize_action(req, 'runbatchuploader', collection=filename_tag980_value) if auth_code != 0: error_msg = _("The user '%(x_user)s' is not authorized to modify collection '%(x_coll)s'") % \ {'x_user': user_info['nickname'], 'x_coll': filename_tag980_value} return (auth_code, error_msg) filename_rec_id_collections = _detect_collections_from_marcxml_file(recs) for filename_rec_id_collection in filename_rec_id_collections: if not webupload: if not filename_rec_id_collection in permitted_dbcollids: return False else: auth_code, auth_message = acc_authorize_action(req, 'runbatchuploader', collection=filename_rec_id_collection) if auth_code != 0: error_msg = _("The user '%(x_user)s' is not authorized to modify collection '%(x_coll)s'") % \ {'x_user': user_info['nickname'], 'x_coll': filename_rec_id_collection} return (auth_code, error_msg) if not webupload: return True else: return (0, " ") def _detect_980_values_from_marcxml_file(recs): """ Read MARCXML file and return list of 980 $a values found in that file. Useful for checking rights. """ from invenio.bibrecord import record_get_field_values collection_tag = run_sql("SELECT value FROM tag, field_tag, field \ WHERE tag.id=field_tag.id_tag AND \ field_tag.id_field=field.id AND \ field.code='collection'") collection_tag = collection_tag[0][0] dbcollids = {} for rec, dummy1, dummy2 in recs: if rec: for tag980 in record_get_field_values(rec, tag=collection_tag[:3], ind1=collection_tag[3], ind2=collection_tag[4], code=collection_tag[5]): dbcollids[tag980] = 1 return dbcollids.keys() def _detect_collections_from_marcxml_file(recs): """ Extract all possible recIDs from MARCXML file and guess collections for these recIDs. """ from invenio.bibrecord import record_get_field_values from invenio.search_engine import guess_collection_of_a_record from invenio.bibupload import find_record_from_sysno, \ find_records_from_extoaiid, \ find_record_from_oaiid dbcollids = {} sysno_tag = CFG_BIBUPLOAD_EXTERNAL_SYSNO_TAG oaiid_tag = CFG_BIBUPLOAD_EXTERNAL_OAIID_TAG oai_tag = CFG_OAI_ID_FIELD for rec, dummy1, dummy2 in recs: if rec: for tag001 in record_get_field_values(rec, '001'): collection = guess_collection_of_a_record(int(tag001)) dbcollids[collection] = 1 for tag_sysno in record_get_field_values(rec, tag=sysno_tag[:3], ind1=sysno_tag[3], ind2=sysno_tag[4], code=sysno_tag[5]): record = find_record_from_sysno(tag_sysno) if record: collection = guess_collection_of_a_record(int(record)) dbcollids[collection] = 1 for tag_oaiid in record_get_field_values(rec, tag=oaiid_tag[:3], ind1=oaiid_tag[3], ind2=oaiid_tag[4], code=oaiid_tag[5]): try: records = find_records_from_extoaiid(tag_oaiid) except Error: records = [] if records: record = records.pop() collection = guess_collection_of_a_record(int(record)) dbcollids[collection] = 1 for tag_oai in record_get_field_values(rec, tag=oai_tag[0:3], ind1=oai_tag[3], ind2=oai_tag[4], code=oai_tag[5]): record = find_record_from_oaiid(tag_oai) if record: collection = guess_collection_of_a_record(int(record)) dbcollids[collection] = 1 return dbcollids.keys() def _transform_input_to_marcxml(file_input=""): """ Takes text-marc as input and transforms it to MARCXML. """ # Create temporary file to read from tmp_fd, filename = tempfile.mkstemp(dir=CFG_TMPSHAREDDIR) os.write(tmp_fd, file_input) os.close(tmp_fd) try: # Redirect output, transform, restore old references old_stdout = sys.stdout new_stdout = StringIO() sys.stdout = new_stdout transform_file(filename) finally: sys.stdout = old_stdout return new_stdout.getvalue() def _log(msg, logfile="webupload.log"): """ Log MSG into LOGFILE with timestamp. """ filedesc = open(CFG_LOGDIR + "/" + logfile, "a") filedesc.write(time.strftime("%Y-%m-%d %H:%M:%S") + " --> " + msg + "\n") filedesc.close() return def _write(req, msg): """ Write MSG to the output stream for the end user. """ req.write(msg + "\n") return diff --git a/modules/bibupload/lib/batchuploader_webinterface.py b/modules/bibupload/lib/batchuploader_webinterface.py index a6f0720ba..df58e7a67 100644 --- a/modules/bibupload/lib/batchuploader_webinterface.py +++ b/modules/bibupload/lib/batchuploader_webinterface.py @@ -1,350 +1,350 @@ # -*- coding: utf-8 -*- ## ## This file is part of Invenio. ## Copyright (C) 2010, 2011, 2013 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """WebUpload web interface""" __revision__ = "$Id$" __lastupdated__ = """$Date$""" from invenio.webinterface_handler_wsgi_utils import Field from invenio.config import CFG_SITE_SECURE_URL from invenio.urlutils import redirect_to_url from invenio.messages import gettext_set_language from invenio.webinterface_handler import wash_urlargd, WebInterfaceDirectory from invenio.webinterface_handler_config import SERVER_RETURN, HTTP_NOT_FOUND from invenio.webinterface_handler_wsgi_utils import handle_file_post from invenio.webuser import getUid, page_not_authorized, get_email from invenio.webpage import page from invenio.batchuploader_engine import metadata_upload, cli_upload, \ get_user_metadata_uploads, get_user_document_uploads, document_upload, \ get_daemon_doc_files, get_daemon_meta_files, cli_allocate_record, \ user_authorization, perform_upload_check, _transform_input_to_marcxml try: import invenio.template batchuploader_templates = invenio.template.load('batchuploader') except: pass class WebInterfaceBatchUploaderPages(WebInterfaceDirectory): """Defines the set of /batchuploader pages.""" _exports = ['', 'metadata', 'metasubmit', 'history', 'documents', 'docsubmit', 'daemon', 'allocaterecord', 'confirm'] def _lookup(self, component, path): def restupload(req, form): """Interface for robots used like this: $ curl --data-binary '@localfile.xml' http://cds.cern.ch/batchuploader/robotupload/[insert|replace|correct|append]?[callback_url=http://...]&nonce=1234 -A invenio_webupload """ filepath, mimetype = handle_file_post(req) argd = wash_urlargd(form, {'callback_url': (str, None), 'nonce': (str, None), 'special_treatment': (str, None)}) return cli_upload(req, open(filepath), '--' + path[0], argd['callback_url'], argd['nonce'], argd['special_treatment']) def legacyrobotupload(req, form): """Interface for robots used like this: $ curl -F 'file=@localfile.xml' -F 'mode=-i' [-F 'callback_url=http://...'] [-F 'nonce=1234'] http://cds.cern.ch/batchuploader/robotupload -A invenio_webupload """ argd = wash_urlargd(form, {'mode': (str, None), 'callback_url': (str, None), 'nonce': (str, None), 'special_treatment': (str, None)}) return cli_upload(req, form.get('file', None), argd['mode'], argd['callback_url'], argd['nonce'], argd['special_treatment']) if component == 'robotupload': - if path and path[0] in ('insert', 'replace', 'correct', 'append'): + if path and path[0] in ('insert', 'replace', 'correct', 'append', 'insertorreplace'): return restupload, None else: return legacyrobotupload, None else: return None, path def index(self, req, form): """ The function called by default """ redirect_to_url(req, "%s/batchuploader/metadata" % (CFG_SITE_SECURE_URL)) def metadata(self, req, form): """ Display Metadata file upload form """ argd = wash_urlargd(form, { 'filetype': (str, ""), 'mode': (str, ""), 'submit_date': (str, "yyyy-mm-dd"), 'submit_time': (str, "hh:mm:ss"), 'email_logs_to': (str, None)}) _ = gettext_set_language(argd['ln']) not_authorized = user_authorization(req, argd['ln']) if not_authorized: return not_authorized uid = getUid(req) if argd['email_logs_to'] is None: argd['email_logs_to'] = get_email(uid) body = batchuploader_templates.tmpl_display_menu(argd['ln'], ref="metadata") body += batchuploader_templates.tmpl_display_web_metaupload_form(argd['ln'], argd['filetype'], argd['mode'], argd['submit_date'], argd['submit_time'], argd['email_logs_to']) title = _("Metadata batch upload") return page(title = title, body = body, metaheaderadd = batchuploader_templates.tmpl_styles(), uid = uid, lastupdated = __lastupdated__, req = req, language = argd['ln'], navmenuid = "batchuploader") def documents(self, req, form): """ Display document upload form """ argd = wash_urlargd(form, { }) _ = gettext_set_language(argd['ln']) not_authorized = user_authorization(req, argd['ln']) if not_authorized: return not_authorized uid = getUid(req) email_logs_to = get_email(uid) body = batchuploader_templates.tmpl_display_menu(argd['ln'], ref="documents") body += batchuploader_templates.tmpl_display_web_docupload_form(argd['ln'], email_logs_to=email_logs_to) title = _("Document batch upload") return page(title = title, body = body, metaheaderadd = batchuploader_templates.tmpl_styles(), uid = uid, lastupdated = __lastupdated__, req = req, language = argd['ln'], navmenuid = "batchuploader") def docsubmit(self, req, form): """ Function called after submitting the document upload form. Performs the appropiate action depending on the input parameters """ argd = wash_urlargd(form, {'docfolder': (str, ""), 'matching': (str, ""), 'mode': (str, ""), 'submit_date': (str, ""), 'submit_time': (str, ""), 'priority': (str, ""), 'email_logs_to': (str, "")}) _ = gettext_set_language(argd['ln']) not_authorized = user_authorization(req, argd['ln']) if not_authorized: return not_authorized date = argd['submit_date'] not in ['yyyy-mm-dd', ''] \ and argd['submit_date'] or '' time = argd['submit_time'] not in ['hh:mm:ss', ''] \ and argd['submit_time'] or '' errors, info = document_upload(req, argd['docfolder'], argd['matching'], argd['mode'], date, time, argd['ln'], argd['priority'], argd['email_logs_to']) body = batchuploader_templates.tmpl_display_menu(argd['ln']) uid = getUid(req) navtrail = '''%s''' % \ (CFG_SITE_SECURE_URL, _("Document batch upload")) body += batchuploader_templates.tmpl_display_web_docupload_result(argd['ln'], errors, info) title = _("Document batch upload result") return page(title = title, body = body, metaheaderadd = batchuploader_templates.tmpl_styles(), uid = uid, navtrail = navtrail, lastupdated = __lastupdated__, req = req, language = argd['ln'], navmenuid = "batchuploader") def allocaterecord(self, req, form): """ Interface for robots to allocate a record and obtain a record identifier """ return cli_allocate_record(req) def metasubmit(self, req, form): """ Function called after submitting the metadata upload form. Checks if input fields are correct before uploading. """ argd = wash_urlargd(form, {'metafile': (str, None), 'filetype': (str, None), 'mode': (str, None), 'submit_date': (str, None), 'submit_time': (str, None), 'filename': (str, None), 'priority': (str, None), 'email_logs_to': (str, None)}) _ = gettext_set_language(argd['ln']) # Check if the page is directly accessed if argd['metafile'] == None: redirect_to_url(req, "%s/batchuploader/metadata" % (CFG_SITE_SECURE_URL)) not_authorized = user_authorization(req, argd['ln']) if not_authorized: return not_authorized date = argd['submit_date'] not in ['yyyy-mm-dd', ''] \ and argd['submit_date'] or '' time = argd['submit_time'] not in ['hh:mm:ss', ''] \ and argd['submit_time'] or '' auth_code, auth_message = metadata_upload(req, argd['metafile'], argd['filetype'], argd['mode'].split()[0], date, time, argd['filename'], argd['ln'], argd['priority'], argd['email_logs_to']) if auth_code == 1: # not authorized referer = '/batchuploader/' return page_not_authorized(req=req, referer=referer, text=auth_message, navmenuid="batchuploader") else: uid = getUid(req) body = batchuploader_templates.tmpl_display_menu(argd['ln']) body += batchuploader_templates.tmpl_upload_successful(argd['ln']) title = _("Upload successful") navtrail = '''%s''' % \ (CFG_SITE_SECURE_URL, _("Metadata batch upload")) return page(title = title, body = body, uid = uid, navtrail = navtrail, lastupdated = __lastupdated__, req = req, language = argd['ln'], navmenuid = "batchuploader") def confirm(self, req, form): """ Function called after submitting the metadata upload form. Shows a summary of actions to be performed and possible errors """ argd = wash_urlargd(form, {'metafile': (Field, None), 'filetype': (str, None), 'mode': (str, None), 'submit_date': (str, None), 'submit_time': (str, None), 'filename': (str, None), 'priority': (str, None), 'skip_simulation': (str, None), 'email_logs_to': (str, None)}) _ = gettext_set_language(argd['ln']) # Check if the page is directly accessed or no file selected if not argd['metafile']: redirect_to_url(req, "%s/batchuploader/metadata" % (CFG_SITE_SECURE_URL)) metafile = argd['metafile'].value if argd['filetype'] != 'marcxml': metafile = _transform_input_to_marcxml(file_input=metafile) date = argd['submit_date'] not in ['yyyy-mm-dd', ''] \ and argd['submit_date'] or '' time = argd['submit_time'] not in ['hh:mm:ss', ''] \ and argd['submit_time'] or '' errors_upload = '' skip_simulation = argd['skip_simulation'] == "skip" if not skip_simulation: errors_upload = perform_upload_check(metafile, argd['mode']) body = batchuploader_templates.tmpl_display_confirm_page(argd['ln'], metafile, argd['filetype'], argd['mode'], date, time, argd['filename'], argd['priority'], errors_upload, skip_simulation, argd['email_logs_to']) uid = getUid(req) navtrail = '''%s''' % \ (CFG_SITE_SECURE_URL, _("Metadata batch upload")) title = 'Confirm your actions' return page(title = title, body = body, metaheaderadd = batchuploader_templates.tmpl_styles(), uid = uid, navtrail = navtrail, lastupdated = __lastupdated__, req = req, language = argd['ln'], navmenuid = "batchuploader") def history(self, req, form): """Display upload history of the current user""" argd = wash_urlargd(form, {}) _ = gettext_set_language(argd['ln']) not_authorized = user_authorization(req, argd['ln']) if not_authorized: return not_authorized uploaded_meta_files = get_user_metadata_uploads(req) uploaded_doc_files = get_user_document_uploads(req) uid = getUid(req) body = batchuploader_templates.tmpl_display_menu(argd['ln'], ref="history") body += batchuploader_templates.tmpl_upload_history(argd['ln'], uploaded_meta_files, uploaded_doc_files) title = _("Upload history") return page(title = title, body = body, metaheaderadd = batchuploader_templates.tmpl_styles(), uid = uid, lastupdated = __lastupdated__, req = req, language = argd['ln'], navmenuid = "batchuploader") def daemon(self, req, form): """ Display content of folders where the daemon will look into """ argd = wash_urlargd(form, {}) _ = gettext_set_language(argd['ln']) not_authorized = user_authorization(req, argd['ln']) if not_authorized: return not_authorized docs = get_daemon_doc_files() metadata = get_daemon_meta_files() uid = getUid(req) body = batchuploader_templates.tmpl_display_menu(argd['ln'], ref="daemon") body += batchuploader_templates.tmpl_daemon_content(argd['ln'], docs, metadata) title = _("Batch Uploader: Daemon monitor") return page(title = title, body = body, metaheaderadd = batchuploader_templates.tmpl_styles(), uid = uid, lastupdated = __lastupdated__, req = req, language = argd['ln'], navmenuid = "batchuploader") def __call__(self, req, form): """Redirect calls without final slash.""" redirect_to_url(req, '%s/batchuploader/metadata' % CFG_SITE_SECURE_URL) diff --git a/modules/webstyle/lib/invenio.wsgi b/modules/webstyle/lib/invenio.wsgi index 4785245f2..c8190641a 100644 --- a/modules/webstyle/lib/invenio.wsgi +++ b/modules/webstyle/lib/invenio.wsgi @@ -1,67 +1,67 @@ # -*- coding: utf-8 -*- ## This file is part of Invenio. ## Copyright (C) 2009, 2010, 2011, 2012 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA. """ mod_wsgi Invenio application loader. """ # start remote debugger if appropriate: from invenio.config import CFG_DEVEL_SITE if CFG_DEVEL_SITE: try: from invenio import remote_debugger remote_debugger.start_file_changes_monitor() except: pass # wrap warnings (usually from sql queries) to log the traceback # of their origin for debugging try: from invenio.errorlib import wrap_warn wrap_warn() except: pass # pre-load citation dictionaries upon WSGI application start-up (the # citation dictionaries are loaded lazily, which is good for CLI # processes such as bibsched, but for web user queries we want them to # be available right after web server start-up): -from invenio.bibrank_citation_searcher import get_cited_by_weight -get_cited_by_weight([]) +#from invenio.bibrank_citation_searcher import get_cited_by_weight +#get_cited_by_weight([]) # pre-load docextract knowledge bases -from invenio.refextract_kbs import get_kbs -get_kbs() +#from invenio.refextract_kbs import get_kbs +#get_kbs() # pre-load docextract author regexp -from invenio.authorextract_re import get_author_regexps -get_author_regexps() +#from invenio.authorextract_re import get_author_regexps +#get_author_regexps() # increase compile regexps cache size for further # speed improvements in docextract import re re._MAXCACHE = 2000 try: from invenio.webinterface_handler_wsgi import application finally: ## mod_wsgi uses one thread to import the .wsgi file ## and a second one to instantiate the application. ## Therefore we need to close redundant conenctions that ## are allocated on the 1st thread. from invenio.dbquery import close_connection close_connection()