DocExtract: new docextract and refextract modules
- Adds DocExtract as a way to easily access all text mining facilities It will allow to extract references, authors, plots, etc. (closes #944)
- Moves the refextract scripts from the bibedit module into its own module.
- Adds a new api to use the refextract module. It includes calls to:
- update_references(): update references by passing a record id;
- extract_references_from_*(): extract and parse references from file/url/record id/string;
- new function that returns the marcxml of the record with updated references;
- new function to check if a record has a fulltext (pdf) attached.
- Refextract filters out null characters from pdfs converted text as they are refused by bibupload.
- Adds several updates to refextract parsing:
- handling of JHEP-like journals, as they need the last 2 digits of the year prepended to the volume;
- adds support for ISBN. They are added in a new subfield called $$i;
- adds support for references like CERN-LHCC2003-01 by transforming it to CERN-LHCC-2003-01;
- adds a new subfield <subfield code="t">Text</subfield> where refextract stores references to quoted text "Text".
- Adds a new option to the bibtask mode of refextract, "--no-overwrite", which checks each record for existing references before parsing it. If the record already has references, it skips it.
- Fixes recent records detection:
- only stores last_updated when running on recent records. This prevents from parsing the most recent reference via --recids n, updating the last_updated field and have refextract skip all references preceeding n;
- only updated last_id and last_updated when respectively the new id is bigger and the new last_updated is more recent. This prevents to store an old date when parsing old records.
- Handles the format arXiv:9910.1234 [physics.ins-det].
- Fixes numeration checking when looking for the end of references.
- Reworks xbook as a single tag: xbook was storing the book title, instead the title is always stored in $$t.
- New authors recognized:
- Figuera-O'Farrill
- P. Pre'
- Dan V. Schroeder
- Adds 9+ and w+ to report numers format.
- Handles Sci.Eng. 450(1-3), 3, 2007 (no space after volume).
- Handles PoS LAT2007 (2007) 12 journal.
- Handles report numbers like CERN/LHCC/98-013.
- Handles urls like http://server/?q=1&w=2.
- Handles C67:674,1998 numeration.
- Adds a new way to recognize journals which is needed when we recognized short titles. Often the short titles or initials of a journal conflict with other names. e.g. DAN (the journal ) and Dan (common first name) We handle it via precise regular expresssions.
- Match Acknowldgment and Acknowledgment as end of sections.
- Format hep report numbers to hep-th/999999.
- Recognizes roman numbers as volume numbers.
- Removes [] and () from o subfield.
- Removes extra spaces at the end of lines.
- Does not try to detect C et D for roman rumbers. It would result in some series letters being detected instead.
- Does not detect "B, 07" volumes anymore since some of these are from journals which are different Phys.Rev. & and Phys.Rev.B.
- Format hep-ex report numbers.
- Tweaks how the beginning and the end of the references sections are found.
- Allows dashes as separators for numeration.
- REST api to run refextract.
- Defaults to inspire format on CLI when running on an inspire site.
- Handles journals withe series included in title.
- Introduces a separator in journals kb: Phys.Rev.B maps to Phys.Rev.;B.
- Handles Phys.Rev.;B by splitting the B from the journal title and adding it in front of the volume.
- Repackages docextract and refextract in one directory.
- Search hook for searching from a reference.
- Updates binaries to use template.in for custom python binaries paths.
- Splits daemon functionality which remains in refextract and cli functionality which is moved to docextract.
- Recognizes publishers.
- Removes JINST from special journals.
- Moves special journals kb to a file
.
- Allows to extract references from an arxiv id.
- kbs loading optimization: they are now cached in memory after being loaded.
- Create RT tickets after extracting references.
- Fixes footer removal when references section contains ")".
- Escape ibid authors for xml (was leading to bibupload failed tasks).
- Handle erratum-ibid (closes #1014)
- Transforms hep-lat-9999 to hep-lat/9999 and astro-php-09 to astro-ph/09.
- arXiv papers can have several revisions over the first week and curation of this papers is delayed by that one week. We decided as a result to re-extract references when an arXiv record is modified on its first week.