Homec4science

DocExtract: multiple fixes

Authored by Alessio Deiana <alessio.deiana@cern.ch> on Jun 13 2012, 13:12.

Description

DocExtract: multiple fixes

  • Fixes for --raw-references for refextract.
  • Handle extra vars for taks_run_core.
  • Adds missing import.
  • Fixes arXiv records selection.
  • Fixes storing last run date for arXiv records.
  • Handles PoS(LAT2005)239.
  • Removes B from volume for nucl.phys.proc.suppl.
  • Adds math-ph to arXiv prefixes.
  • Never re-extract curated records.
  • Does not default to page 1 for special journals.
  • When the page looks like a year (19xx or 20xx) then converts the reference to misc text.
  • Remove citation splitting heuristics by authors.
  • New way of splitting references.
  • Extracts year from references.
  • Creates a subfield "y" with the year when it is known.
  • Reworks arXiv report numbers processing.
  • Adds CMS report numbers.
  • Adds CERN-2004-003 to recognized reports.
  • Does not default to page 1 when looking for numerotation.
  • Removes use of all and any (for Python 2.4 compatibility)
  • Removes extra write_message.
  • Handles Phys.Lett. 100B (1981), 117.
  • Discards references with many lines.
  • Handles refs with only a report number.
  • Fixes [6] ATL-PHYS-INT-2009-110 reportnumber detection.
  • Adds test for recognizing a journal/reportnumber/doi alone.
  • Bump refextract version.
  • Fixes references splitting.
  • Fixes refs extraction from string.
  • Handles refs that do not start at the beginning of a line.
  • Increases allowed refs len to 9 lines.
  • Fixes report numbers replacing.
  • Fixes report number kb format.
  • Increases max number of lines for a reference. (closes #966)
  • Handles figures within refs.
  • Adds RT tickets info to logs,
  • Fixes extra vars passing to tasks.
  • Updates stats message.
  • Updates refextract tmp filename in inveniogc.
  • Fixes -c option.
  • Forces Inspire format for tests.
  • Fixes api unit tests.
  • Fixes for printing refs.
  • Extract collaboration into $$c subfield if CFG_INSPIRE_SITE=1. (closes #958)

Conflicts:
modules/docextract/lib/docextract_task.py

Details

Committed
Tibor Simko <tibor.simko@cern.ch>Nov 27 2012, 14:46
Parents
R3600:9c44fffa48ab: DocExtract: new docextract and refextract modules
Branches
Unknown
Tags
Unknown

Event Timeline

Tibor Simko <tibor.simko@cern.ch> committed R3600:c7b40434a6d9: DocExtract: multiple fixes (authored by Alessio Deiana <alessio.deiana@cern.ch>).Nov 27 2012, 14:46