Diffusion invenio-infoscience 9c44fffa48ab

DocExtract: new docextract and refextract modules
9c44fffa48ab
Actions

Authored by Alessio Deiana <alessio.deiana@cern.ch> on Jul 12 2012, 11:28.

Description

DocExtract: new docextract and refextract modules

Adds DocExtract as a way to easily access all text mining facilities It will allow to extract references, authors, plots, etc. (closes #944)

Moves the refextract scripts from the bibedit module into its own module.

Adds a new api to use the refextract module. It includes calls to:
- update_references(): update references by passing a record id;
- extract_references_from_*(): extract and parse references from file/url/record id/string;
- new function that returns the marcxml of the record with updated references;
- new function to check if a record has a fulltext (pdf) attached.

Refextract filters out null characters from pdfs converted text as they are refused by bibupload.

Adds several updates to refextract parsing:
- handling of JHEP-like journals, as they need the last 2 digits of the year prepended to the volume;
- adds support for ISBN. They are added in a new subfield called $$i;
- adds support for references like CERN-LHCC2003-01 by transforming it to CERN-LHCC-2003-01;
- adds a new subfield <subfield code="t">Text</subfield> where refextract stores references to quoted text "Text".

Adds a new option to the bibtask mode of refextract, "--no-overwrite", which checks each record for existing references before parsing it. If the record already has references, it skips it.

Fixes recent records detection:
- only stores last_updated when running on recent records. This prevents from parsing the most recent reference via --recids n, updating the last_updated field and have refextract skip all references preceeding n;
- only updated last_id and last_updated when respectively the new id is bigger and the new last_updated is more recent. This prevents to store an old date when parsing old records.

Handles the format arXiv:9910.1234 [physics.ins-det].

Fixes numeration checking when looking for the end of references.

Reworks xbook as a single tag: xbook was storing the book title, instead the title is always stored in $$t.

New authors recognized:
- Figuera-O'Farrill
- P. Pre'
- Dan V. Schroeder

Adds 9+ and w+ to report numers format.

Handles Sci.Eng. 450(1-3), 3, 2007 (no space after volume).

Handles PoS LAT2007 (2007) 12 journal.

Handles report numbers like CERN/LHCC/98-013.

Handles urls like http://server/?q=1&w=2.

Handles C67:674,1998 numeration.

Adds a new way to recognize journals which is needed when we recognized short titles. Often the short titles or initials of a journal conflict with other names. e.g. DAN (the journal ) and Dan (common first name) We handle it via precise regular expresssions.

Match Acknowldgment and Acknowledgment as end of sections.

Format hep report numbers to hep-th/999999.

Recognizes roman numbers as volume numbers.

Removes [] and () from o subfield.

Removes extra spaces at the end of lines.

Does not try to detect C et D for roman rumbers. It would result in some series letters being detected instead.

Does not detect "B, 07" volumes anymore since some of these are from journals which are different Phys.Rev. & and Phys.Rev.B.

Format hep-ex report numbers.

Tweaks how the beginning and the end of the references sections are found.

Allows dashes as separators for numeration.

REST api to run refextract.

Defaults to inspire format on CLI when running on an inspire site.

Handles journals withe series included in title.

Introduces a separator in journals kb: Phys.Rev.B maps to Phys.Rev.;B.

Handles Phys.Rev.;B by splitting the B from the journal title and adding it in front of the volume.

Repackages docextract and refextract in one directory.

Search hook for searching from a reference.

Updates binaries to use template.in for custom python binaries paths.

Splits daemon functionality which remains in refextract and cli functionality which is moved to docextract.

Recognizes publishers.

Removes JINST from special journals.

Moves special journals kb to a file

Allows to extract references from an arxiv id.

kbs loading optimization: they are now cached in memory after being loaded.

Create RT tickets after extracting references.

Fixes footer removal when references section contains ")".

Escape ibid authors for xml (was leading to bibupload failed tasks).

Handle erratum-ibid (closes #1014)

Transforms hep-lat-9999 to hep-lat/9999 and astro-php-09 to astro-ph/09.

arXiv papers can have several revisions over the first week and curation of this papers is delayed by that one week. We decided as a result to re-extract references when an arXiv record is modified on its first week.

Details

Committed

Tibor Simko <tibor.simko@cern.ch>

Nov 27 2012, 14:44

Parents

R3600:47f716489604: Merge branch 'maint-1.1'

Branches

Unknown

Tags

Unknown

Event Timeline

Tibor Simko <tibor.simko@cern.ch> committed R3600:9c44fffa48ab: DocExtract: new docextract and refextract modules (authored by Alessio Deiana <alessio.deiana@cern.ch>).Nov 27 2012, 14:44

Changes (89)

		Path
M		.gitignore
M		config/invenio.conf
M		configure.ac
M		modules/Makefile.am
A	(dir)	modules/docextract/
P		modules/docextract/Makefile.am Copied from modules/bibupload/doc/Makefile.am
A	(dir)	modules/docextract/bin/
P		modules/docextract/bin/Makefile.am Copied from modules/bibedit/bin/Makefile.am
P		modules/docextract/bin/docextract.in Copied from modules/refextract/bin/refextract.in
V		modules/{docextract ← refextract}/bin/refextract.in
A	(dir)	modules/docextract/doc/
P		modules/docextract/doc/Makefile.am Copied from modules/bibedit/doc/Makefile.am
A	(dir)	modules/docextract/doc/admin/
P		modules/docextract/doc/admin/Makefile.am Copied from modules/bibrank/doc/admin/Makefile.am
P		modules/docextract/doc/admin/docextract-admin-guide.webdoc Copied from modules/bibauthorid/lib/__init__.py
A	(dir)	modules/docextract/doc/hacking/
P		modules/docextract/doc/hacking/Makefile.am Copied from modules/webstyle/doc/hacking/Makefile.am
P		modules/docextract/doc/hacking/docextract-internals.webdoc Copied from modules/bibformat/lib/elements/test3.py
A	(dir)	modules/docextract/etc/
V		modules/{docextract ← refextract}/etc/Makefile.am
A		modules/docextract/etc/authors.kb
A		modules/docextract/etc/books.kb
V		modules/{docextract/etc/collaborations.kb ← refextract/etc/refextract-authors.kb}
P		modules/docextract/etc/conferences.kb Copied from modules/bibauthorid/lib/__init__.py
A		modules/docextract/etc/example.pdf
A		modules/docextract/etc/example.txt
V		modules/{docextract/etc/job-preprints.cfg ← refextract/etc/refextract-job-preprints.cfg}
A		modules/docextract/etc/journal-titles-re.kb
V		modules/{docextract/etc/journal-titles.kb ← refextract/etc/refextract-journal-titles.kb}
A		modules/docextract/etc/publishers.kb
V		modules/{docextract/etc/report-numbers.kb ← refextract/etc/refextract-report-numbers.kb}
A		modules/docextract/etc/special-journals.kb
A		modules/docextract/etc/test1.txt
A		modules/docextract/etc/test2.txt
A	(dir)	modules/docextract/lib/
V		modules/{docextract ← refextract}/lib/Makefile.am
A		modules/docextract/lib/authorextract_re.py
A		modules/docextract/lib/docextract_pdf.py
A		modules/docextract/lib/docextract_task.py
A		modules/docextract/lib/docextract_text.py
P		modules/docextract/lib/docextract_utils.py Copied from modules/oaiharvest/web/admin/Makefile.am
A		modules/docextract/lib/docextract_webinterface.py
A		modules/docextract/lib/docextract_webinterface_tests.py
A		modules/docextract/lib/refextract_api.py
A		modules/docextract/lib/refextract_api_tests.py
A		modules/docextract/lib/refextract_cli.py
A		modules/docextract/lib/refextract_config.py
A		modules/docextract/lib/refextract_engine.py
A		modules/docextract/lib/refextract_find.py
A		modules/docextract/lib/refextract_kbs.py
A		modules/docextract/lib/refextract_re.py
A		modules/docextract/lib/refextract_regression_tests.py
A		modules/docextract/lib/refextract_tag.py
A		modules/docextract/lib/refextract_task.py
A		modules/docextract/lib/refextract_tests.py
A		modules/docextract/lib/refextract_text.py
A		modules/docextract/lib/refextract_xml.py
M		modules/miscutil/lib/inveniocfg.py
M		modules/miscutil/sql/tabcreate.sql
M		modules/miscutil/sql/tabfill.sql
M		modules/oaiharvest/lib/oai_harvest_daemon.py
D		modules/refextract
D		modules/refextract/Makefile.am
D		modules/refextract/bin
D		modules/refextract/bin/Makefile.am
P		modules/refextract/bin/refextract.in Deleted after being copied to multiple locations: modules/docextract/bin/docextract.in modules/docextract/bin/refextract.in
D		modules/refextract/doc
D		modules/refextract/doc/Makefile.am
D		modules/refextract/doc/admin
D		modules/refextract/doc/admin/Makefile.am
D		modules/refextract/doc/hacking
D		modules/refextract/doc/hacking/Makefile.am
D		modules/refextract/etc
V		modules/{refextract → docextract}/etc/Makefile.am
V		modules/{refextract/etc/refextract-authors.kb → docextract/etc/collaborations.kb}
V		modules/{refextract/etc/refextract-job-preprints.cfg → docextract/etc/job-preprints.cfg}
V		modules/{refextract/etc/refextract-journal-titles.kb → docextract/etc/journal-titles.kb}
V		modules/{refextract/etc/refextract-report-numbers.kb → docextract/etc/report-numbers.kb}
D		modules/refextract/lib
V		modules/{refextract → docextract}/lib/Makefile.am
D		modules/refextract/lib/refextract.py
D		modules/refextract/lib/refextract_authextract_unit_tests.py
D		modules/refextract/lib/refextract_cli.py
D		modules/refextract/lib/refextract_config.py
D		modules/refextract/lib/refextract_daemon.py
D		modules/refextract/lib/refextract_unit_tests.py
M		modules/websearch/lib/search_engine.py
M		modules/websearch/lib/search_engine_query_parser.py
M		modules/webstyle/lib/webinterface_layout.py