Homec4science

BibHarvest: reimplementation of OAI repository

Authored by Samuele Kaplun <samuele.kaplun@cern.ch> on Jul 8 2011, 15:20.

Description

BibHarvest: reimplementation of OAI repository

  • New CFG_OAI_PREVIOUS_SET_FIELD parameter in invenio.conf to store which set_spec a record was member of in the past (suitable to discover when records are moving outside of a set).
  • CFG_OAI_DELETED_POLICY is now set by default to 'persistent' since the support for OAI-PMH deleted policy is now fully implemented. (closes #778)
  • CFG_OAI_SAMPLE_IDENTIFIER now correctly set to a sensible default.
  • Simplified CFG_OAI_IDENTIFY_DESCRIPTION configuration by generating the OAI-PMH verb=Identify response in a more automatic way.
  • CFG_OAI_SLEEP is now set by default to 2, since the performance of the OAI-PMH handler has improved in several directions.
  • New CFG_OAI_METADATA_FORMATS parameter to be able to configure which OAI-PMH metadataPrefixes are understood, and by integrating OAI-PMH record representation to BibFormat. OAI MARCXML is now generated via XOAIMARC.bfo output_format which by defaults uses OAI_MARC.bft format_template. (closes #426)
  • New CFG_OAI_PROVENANCE_* parameters to support exporting provenance information for record harvested by other repositories. Updated default demo *2marcxml.xsl stylesheet used when harvesting. (remove partially broken usage of 909CO field for provenance information). (closes #122)
  • When CFG_SITE_DEVEL is set to True, automatically validates OAI-PMH responses against provided OAI-PMH.xsd schema, in order to assert their correctness.
  • Greatly improved speed of oai_repository_server code, partially refactored to improve kwalitee. Improved error handling, resumptionToken manipulation, buffering, caching, completeness of responses...
  • Greatly improved speed of oai_repository_updater code, fully rewriting the main task, by using intbitset and by considering all and only the records that really need to be updated. (closes #300)
  • BibUpload to support an alterd flag (as configured by CFG_OAI_PROVENANCE_ALTERED_SUBFIELD) and used by the above mentioned new feauture CFG_OAI_PROVENANCE_*.
  • New handy tools in htmlutils to generate HTML or XML in a Pythonic way.
  • New get_all_field_values function in search_engine to quickly obtain all existing values for a certain field.
  • As side modification, removed everywhere HitSet aliases to intbitset.
  • Records with no OAI IDs are not exported. Records with more than one OAI ID, are exported, but only the first OAI ID found is used, and an alert is raised to the admin.

Details

Event Timeline

Tibor Simko <tibor.simko@cern.ch> committed R3600:ee9a940e2546: BibHarvest: reimplementation of OAI repository (authored by Samuele Kaplun <samuele.kaplun@cern.ch>).Nov 23 2011, 18:24