Homec4science

BibMatch: match validation

Authored by Jan Aage Lavik <jan.age.lavik@cern.ch> on Sep 22 2011, 16:00.

Description

BibMatch: match validation

  • Adds a new sub-module for comparing records after searching for potentially matching records, called the match validation step. (fixes #548)
    • Various methods are used when comparing records, for example special metrics for comparing authors, titles and identifiers.

      These comparison methods are configurable per (sub-)field and acts as rules for matching records. These rules can be grouped in rulesets using regular expressions, allowing records to be compared differently based on content. (fixes #183)
    • For an exact match to happen all defined comparison rules must succeed. If they do not all succeed, but the ratio of success is above a certain (configurable) limit, the match is considered fuzzy. Two or more matching fields MUST be found, unless certain MARC fields have been configured as 'final' or 'joker' types, i.e. identifier fields such as DOI or ISBN.
    • Another configurable is added to control the limit of maximum number of search results to compare for a single search query.
  • Both match validation and fuzzy searching are toggleable using the CLI commands '--no-valid' and '--no-fuzzy' respectively.
  • New command available, '--ascii', for transliterating record values to ASCII before being used in searching and matching. XML entities, like &amp;, are transformed to UTF-8 before searches.
  • Adds a configuration module specific for BibMatch internal globals.
  • Enables automatic logging of BibMatch runs, providing information about record matching results.
  • Also adds applicable regression tests, a new unit-test module and brand new admin and hacking guides.
  • Detects if any input records are badly parsed by BibRecord.

Details

Committed
Tibor Simko <tibor.simko@cern.ch>Jan 17 2012, 15:38
Parents
R3600:91da4331062f: bibrecord: improve handling of record parser
Branches
Unknown
Tags
Unknown

Event Timeline

Tibor Simko <tibor.simko@cern.ch> committed R3600:d39a330c306e: BibMatch: match validation (authored by Jan Aage Lavik <jan.age.lavik@cern.ch>).Jan 17 2012, 15:38