Homec4science

OAIHarvest: remove_duplicates and regexp fixes

Authored by Alexander Wagner <alexander.wagner@desy.de> on Dec 3 2014, 11:24.

Description

OAIHarvest: remove_duplicates and regexp fixes

  • The getter used some regular expression based on the string of the 'verb' parameter. The 'verb' in question here is 'ListRecords' and it also starts the initiation of a data block. However, xml allows explicit name spaces and thus this query fails in case we get qualified xml.
  • According to the specs of OAI-PMH the server should return a 'noRecordsMatch' error. (Effectively, being the reason for the missing ListRecords stanza.). Thus check for the existence of this error instead gives a valid check.
  • Additionally, the "parsing" fo the resumptionToken also assumed that we do never get any namespace. This is hotfixed by a more general regexp, that allows for the inclusion of namespaces, but ultimately one should consider real XML parsing. (cf. FIXME)
  • remove_duplicates() relied on regexping the results to retrieve duplicate oai-keys from a source and remove the duplicate records. This, however, does not work with qualified xml containing name spaces. The old method will just corrput the XML output. Thus introduce xml parsing by means of lxml to remove the dupes.
  • Adds try/except for loading lxml as lxml is no requirement in maint-1.1.
  • This patch was introduced to cope with OpenAIRE project set. (closes #2300) (PR #2608)

Tested-by: Tibor Simko <tibor.simko@cern.ch>

Details

Committed
Tibor Simko <tibor.simko@cern.ch>Jan 26 2015, 14:32
Parents
R3600:bdf145891b5f: installation: explicit jQuery plugin versions
Branches
Unknown
Tags
Unknown

Event Timeline

Tibor Simko <tibor.simko@cern.ch> committed R3600:3d60603e08d0: OAIHarvest: remove_duplicates and regexp fixes (authored by Alexander Wagner <alexander.wagner@desy.de>).Jan 26 2015, 14:32