Homec4science

RefExtract: improves docextract, refextract

Authored by Alessio Deiana <alessio.deiana@cern.ch> on Oct 12 2012, 15:15.

Description

RefExtract: improves docextract, refextract

  • Bumps version to 1.5.32
  • Adds "0" subfield that associates recids to refs.
  • Improves arxiv rn parsing.
  • More permissive url, authors and doi regexps.
  • Reworks pagination removal.
  • Reworks journals identification.
  • Reworks references splitting.
  • Reworks refs section detection.
  • Handles authors with weird characters.
  • Fixes -c option.
  • Prevents have 2 "a" (author) subfields.
  • Improves handling of PoS.
  • Removes "xbook" subfield.
  • Adds "c" (collaboration) subfield.
  • Only accept 1 as first ref.
  • Only output o subfield if marker is present.
  • Tweaks find numeration.
  • Only create tickets for records that are less than 2 years old
  • Only accept 1 as first ref.
  • Only output o subfield if marker is present.
  • Fixes marker detection.
  • Allows : and - in line markers.
  • Tweaks find numeration.
  • Fixes find rawref The tag number for converting to the journal format was hardcoded This is not the case anymore.
  • Handles ATL-CONF: Hande several formats for atlas conferences used in references ATL-CONF-99, ATLAS-CONF-2010-001, ...
  • Fixes missing lines when a line marker with a number that. does not match the sequence (1, 2, 3, etc.) is detected.
  • We do not strip pagination anymore. I am keeping this test in case we re-enable it.
  • Fixes bug that would remove letter from o subfield
  • arXiv report numbers with a category are not found by the search engine so we remove the category before searching.
  • Script to convert journals abbreviations to short form
  • New record model similar to bibrecord but easier to use
  • Switches to the new bibrecord model to generate xml. This way we can generate it more easily and we can ensure it will always be valid.
  • Encoding tweaking: we now return bytes instead of unicode for functions that provide a serialization format namely xml
  • If a reference has 2 different DOI, we split it We also remove duplicate doi tags.
  • Prevent authors with "paper" as name
  • Handles 1.1 numeration
  • When unidecode produced a string with a different length than the original, the replace would be off overwriting text to the right. We know pass the output through unidecode in that case.

Signed-off-by: Alessio Deiana <alessio.deiana@cern.ch>

Details

Committed
Samuele Kaplun <samuele.kaplun@cern.ch>Dec 18 2013, 16:21
Parents
R3600:9f5b2195c462: HepData: new HepData module
Branches
Unknown
Tags
Unknown

Event Timeline

Samuele Kaplun <samuele.kaplun@cern.ch> committed R3600:1b7b4c571ed3: RefExtract: improves docextract, refextract (authored by Alessio Deiana <alessio.deiana@cern.ch>).Dec 18 2013, 16:21