Homec4science

DocExtract: several improvements

Authored by Alessio Deiana <alessio.deiana@cern.ch> on Jan 24 2013, 18:02.

Description

DocExtract: several improvements

  • Ability to load journals from journal records database
  • Moves html generation of /textmining to a dedicated template file so that it can be overwritten by INPSIRE
  • Reworks the removal of [] and () around report numbers
  • Which fixes a bug that would add extra characters in the misc string
  • Only unidecode strings when authors are present
  • Replaced the [A-Z] with a list of uppercase characters that is computed at load.
  • Adds handling for caret in pdf2text results
  • We can know check for existing tickets for a given recid. Which means we can always create tickets for extracted references by making sure there is not existing ticket.
  • Adds a function to know if we can safely extract references from a record.
  • Submits a single bibupload task, for a refextract task run instead of one per record.

Signed-off-by: Alessio Deiana <alessio.deiana@cern.ch>
Tested-by: Samuele Kaplun <samuele.kaplun@cern.ch>

Details

Committed
Samuele Kaplun <samuele.kaplun@cern.ch>Dec 18 2013, 16:21
Parents
R3600:0ba3287c6292: BibFormat: empty record check
Branches
Unknown
Tags
Unknown

Event Timeline

Samuele Kaplun <samuele.kaplun@cern.ch> committed R3600:0e5307c7bb7c: DocExtract: several improvements (authored by Alessio Deiana <alessio.deiana@cern.ch>).Dec 18 2013, 16:21