Homec4science

refextract: improve affiliated author search

Description

refextract: improve affiliated author search

  • Include delimiters when arranging affiliated authors.
  • Preserve the realigned numeration when searching for authors.
  • Reuse the 'around-comma' numeration swapping when looking for affiliated authors.
  • Add another config variable capable for the replacement of affiliation terms. Rename the original other affiliation config variable to include the work 'reduction'.
  • Improve the numeration obtaining regular expressions; Only match numeration on lines which hold other content too.
  • Collect numerated affiliation data together when searching.
  • Show the list of affiliated authors per affiliation when searching for affiliated authors. Control with verbosity cli option.
  • Change the flag associated with the extraction of affiliations from -f to -l, avoiding the issue of the forthcoming fulltext api change to Refextract (-f, --fulltext for providing fulltext input)
  • Fix the mechanism of adding to the list of affiliated author info, by only appending a new affiliated author item if authors actually exist for that item. This prevents an invalid selection of a set of affiliated authors (over a set of standard authors), in the event that no actual authors exist, just affiliation/strength data.
  • Add cli verbosity-controlled messages, depicting the current status of the author extraction process.
  • Repair the cli arguments used inside get_cli_opts.
  • Change the returning document information from extract_top_document_information_from_fulltext. Now returns a list of dictionaries containing author data with possible affiliations, and a list of affiliation data.
    • This excludes a list of 'marked-up' author data, which is now assembled outside of this function call.
  • Relocate the act of locating of a document's reference section into the functions concerned with either extracting references or authors/affiliations.
  • Rename variables relating to lines holding either reference or top-section data, away from reference specific names.

Details

Committed
Tibor Simko <tibor.simko@cern.ch>Nov 23 2011, 00:34
Parents
R3600:f12b25aa95c4: refextract: improve realign numeration
Branches
Unknown
Tags
Unknown

Event Timeline

Tibor Simko <tibor.simko@cern.ch> committed R3600:49989b534236: refextract: improve affiliated author search (authored by Christopher Hayward <christopher.james.hayward@cern.ch>).Nov 23 2011, 00:34