
DocExtract: new docextract and refextract modules

Authored by Alessio Deiana <alessio.deiana@cern.ch> on Jul 12 2012, 11:28.


DocExtract: new docextract and refextract modules

  • Adds DocExtract as a way to easily access all text mining facilities It will allow to extract references, authors, plots, etc. (closes #944)
  • Moves the refextract scripts from the bibedit module into its own module.
  • Adds a new api to use the refextract module. It includes calls to:
    • update_references(): update references by passing a record id;
    • extract_references_from_*(): extract and parse references from file/url/record id/string;
    • new function that returns the marcxml of the record with updated references;
    • new function to check if a record has a fulltext (pdf) attached.
  • Refextract filters out null characters from pdfs converted text as they are refused by bibupload.
  • Adds several updates to refextract parsing:
    • handling of JHEP-like journals, as they need the last 2 digits of the year prepended to the volume;
    • adds support for ISBN. They are added in a new subfield called $$i;
    • adds support for references like CERN-LHCC2003-01 by transforming it to CERN-LHCC-2003-01;
    • adds a new subfield <subfield code="t">Text</subfield> where refextract stores references to quoted text "Text".
  • Adds a new option to the bibtask mode of refextract, "--no-overwrite", which checks each record for existing references before parsing it. If the record already has references, it skips it.
  • Fixes recent records detection:
    • only stores last_updated when running on recent records. This prevents from parsing the most recent reference via --recids n, updating the last_updated field and have refextract skip all references preceeding n;
    • only updated last_id and last_updated when respectively the new id is bigger and the new last_updated is more recent. This prevents to store an old date when parsing old records.
  • Handles the format arXiv:9910.1234 [physics.ins-det].
  • Fixes numeration checking when looking for the end of references.
  • Reworks xbook as a single tag: xbook was storing the book title, instead the title is always stored in $$t.
  • New authors recognized:
    • Figuera-O'Farrill
    • P. Pre'
    • Dan V. Schroeder
  • Adds 9+ and w+ to report numers format.
  • Handles Sci.Eng. 450(1-3), 3, 2007 (no space after volume).
  • Handles PoS LAT2007 (2007) 12 journal.
  • Handles report numbers like CERN/LHCC/98-013.
  • Handles C67:674,1998 numeration.
  • Adds a new way to recognize journals which is needed when we recognized short titles. Often the short titles or initials of a journal conflict with other names. e.g. DAN (the journal ) and Dan (common first name) We handle it via precise regular expresssions.
  • Match Acknowldgment and Acknowledgment as end of sections.
  • Format hep report numbers to hep-th/999999.
  • Recognizes roman numbers as volume numbers.
  • Removes [] and () from o subfield.
  • Removes extra spaces at the end of lines.
  • Does not try to detect C et D for roman rumbers. It would result in some series letters being detected instead.
  • Does not detect "B, 07" volumes anymore since some of these are from journals which are different Phys.Rev. & and Phys.Rev.B.
  • Format hep-ex report numbers.
  • Tweaks how the beginning and the end of the references sections are found.
  • Allows dashes as separators for numeration.
  • REST api to run refextract.
  • Defaults to inspire format on CLI when running on an inspire site.
  • Handles journals withe series included in title.
  • Introduces a separator in journals kb: Phys.Rev.B maps to Phys.Rev.;B.
  • Handles Phys.Rev.;B by splitting the B from the journal title and adding it in front of the volume.
  • Repackages docextract and refextract in one directory.
  • Search hook for searching from a reference.
  • Updates binaries to use template.in for custom python binaries paths.
  • Splits daemon functionality which remains in refextract and cli functionality which is moved to docextract.
  • Recognizes publishers.
  • Removes JINST from special journals.
  • Moves special journals kb to a file


  • Allows to extract references from an arxiv id.
  • kbs loading optimization: they are now cached in memory after being loaded.
  • Create RT tickets after extracting references.
  • Fixes footer removal when references section contains ")".
  • Escape ibid authors for xml (was leading to bibupload failed tasks).
  • Handle erratum-ibid (closes #1014)
  • Transforms hep-lat-9999 to hep-lat/9999 and astro-php-09 to astro-ph/09.
  • arXiv papers can have several revisions over the first week and curation of this papers is delayed by that one week. We decided as a result to re-extract references when an arXiv record is modified on its first week.


Tibor Simko <tibor.simko@cern.ch>Nov 27 2012, 14:44
R3600:47f716489604: Merge branch 'maint-1.1'

Event Timeline

Tibor Simko <tibor.simko@cern.ch> committed R3600:9c44fffa48ab: DocExtract: new docextract and refextract modules (authored by Alessio Deiana <alessio.deiana@cern.ch>).Nov 27 2012, 14:44