DocExtract: several improvements
- Ability to load journals from journal records database
- Moves html generation of /textmining to a dedicated template file so that it can be overwritten by INPSIRE
- Reworks the removal of [] and () around report numbers
- Which fixes a bug that would add extra characters in the misc string
- Only unidecode strings when authors are present
- Replaced the [A-Z] with a list of uppercase characters that is computed at load.
- Adds handling for caret in pdf2text results
- We can know check for existing tickets for a given recid. Which means we can always create tickets for extracted references by making sure there is not existing ticket.
- Adds a function to know if we can safely extract references from a record.
- Submits a single bibupload task, for a refextract task run instead of one per record.
Signed-off-by: Alessio Deiana <alessio.deiana@cern.ch>
Tested-by: Samuele Kaplun <samuele.kaplun@cern.ch>