RefExtract: improves docextract, refextract
- Bumps version to 1.5.32
- Adds "0" subfield that associates recids to refs.
- Improves arxiv rn parsing.
- More permissive url, authors and doi regexps.
- Reworks pagination removal.
- Reworks journals identification.
- Reworks references splitting.
- Reworks refs section detection.
- Handles authors with weird characters.
- Fixes -c option.
- Prevents have 2 "a" (author) subfields.
- Improves handling of PoS.
- Removes "xbook" subfield.
- Adds "c" (collaboration) subfield.
- Only accept 1 as first ref.
- Only output o subfield if marker is present.
- Tweaks find numeration.
- Only create tickets for records that are less than 2 years old
- Only accept 1 as first ref.
- Only output o subfield if marker is present.
- Fixes marker detection.
- Allows : and - in line markers.
- Tweaks find numeration.
- Fixes find rawref The tag number for converting to the journal format was hardcoded This is not the case anymore.
- Handles ATL-CONF: Hande several formats for atlas conferences used in references ATL-CONF-99, ATLAS-CONF-2010-001, ...
- Fixes missing lines when a line marker with a number that. does not match the sequence (1, 2, 3, etc.) is detected.
- We do not strip pagination anymore. I am keeping this test in case we re-enable it.
- Fixes bug that would remove letter from o subfield
- arXiv report numbers with a category are not found by the search engine so we remove the category before searching.
- Script to convert journals abbreviations to short form
- New record model similar to bibrecord but easier to use
- Switches to the new bibrecord model to generate xml. This way we can generate it more easily and we can ensure it will always be valid.
- Encoding tweaking: we now return bytes instead of unicode for functions that provide a serialization format namely xml
- If a reference has 2 different DOI, we split it We also remove duplicate doi tags.
- Prevent authors with "paper" as name
- Handles 1.1 numeration
- When unidecode produced a string with a different length than the original, the replace would be off overwriting text to the right. We know pass the output through unidecode in that case.
Signed-off-by: Alessio Deiana <alessio.deiana@cern.ch>