Diffusion invenio-infoscience 1b7b4c571ed3

RefExtract: improves docextract, refextract
1b7b4c571ed3
Actions

Authored by Alessio Deiana <alessio.deiana@cern.ch> on Oct 12 2012, 15:15.

Description

RefExtract: improves docextract, refextract

Bumps version to 1.5.32
Adds "0" subfield that associates recids to refs.
Improves arxiv rn parsing.
More permissive url, authors and doi regexps.
Reworks pagination removal.
Reworks journals identification.
Reworks references splitting.
Reworks refs section detection.
Handles authors with weird characters.
Fixes -c option.
Prevents have 2 "a" (author) subfields.
Improves handling of PoS.
Removes "xbook" subfield.
Adds "c" (collaboration) subfield.
Only accept 1 as first ref.
Only output o subfield if marker is present.
Tweaks find numeration.
Only create tickets for records that are less than 2 years old
Only accept 1 as first ref.
Only output o subfield if marker is present.
Fixes marker detection.
Allows : and - in line markers.
Tweaks find numeration.
Fixes find rawref The tag number for converting to the journal format was hardcoded This is not the case anymore.
Handles ATL-CONF: Hande several formats for atlas conferences used in references ATL-CONF-99, ATLAS-CONF-2010-001, ...
Fixes missing lines when a line marker with a number that. does not match the sequence (1, 2, 3, etc.) is detected.
We do not strip pagination anymore. I am keeping this test in case we re-enable it.
Fixes bug that would remove letter from o subfield
arXiv report numbers with a category are not found by the search engine so we remove the category before searching.
Script to convert journals abbreviations to short form
New record model similar to bibrecord but easier to use
Switches to the new bibrecord model to generate xml. This way we can generate it more easily and we can ensure it will always be valid.
Encoding tweaking: we now return bytes instead of unicode for functions that provide a serialization format namely xml
If a reference has 2 different DOI, we split it We also remove duplicate doi tags.
Prevent authors with "paper" as name
Handles 1.1 numeration
When unidecode produced a string with a different length than the original, the replace would be off overwriting text to the right. We know pass the output through unidecode in that case.

Signed-off-by: Alessio Deiana <alessio.deiana@cern.ch>

Details

Committed

Samuele Kaplun <samuele.kaplun@cern.ch>

Dec 18 2013, 16:21

Parents

R3600:9f5b2195c462: HepData: new HepData module

Branches

Unknown

Tags

Unknown

Event Timeline

Samuele Kaplun <samuele.kaplun@cern.ch> committed R3600:1b7b4c571ed3: RefExtract: improves docextract, refextract (authored by Alessio Deiana <alessio.deiana@cern.ch>).Dec 18 2013, 16:21

Changes (35)

				Path
	M			INSTALL
	M			configure.ac
	M			modules/bibedit/lib/bibedit_utils.py
	M			modules/docextract/bin/Makefile.am
	P			modules/docextract/bin/convert_journals.in Copied from modules/oairepository/bin/oairepositoryupdater.in
	M			modules/docextract/etc/collaborations.kb
	M			modules/docextract/etc/report-numbers.kb
	M			modules/docextract/lib/Makefile.am
	M			modules/docextract/lib/authorextract_re.py
	A			modules/docextract/lib/docextract_convert_journals.py
	A			modules/docextract/lib/docextract_convert_journals_unit_tests.py
	A			modules/docextract/lib/docextract_record.py
	A			modules/docextract/lib/docextract_record_regression_tests.py
	M			modules/docextract/lib/docextract_task.py
	M			modules/docextract/lib/docextract_text.py
	M			modules/docextract/lib/docextract_utils.py
	M			modules/docextract/lib/docextract_webinterface.py
	V			modules/docextract/lib/{docextract_webinterface_regression_tests.py ← docextract_webinterface_unit_tests.py}
	V			modules/docextract/lib/{docextract_webinterface_unit_tests.py → docextract_webinterface_regression_tests.py}
	M			modules/docextract/lib/refextract_api.py
	M			modules/docextract/lib/refextract_cli.py
	M			modules/docextract/lib/refextract_config.py
	M			modules/docextract/lib/refextract_engine.py
	M			modules/docextract/lib/refextract_find.py
	M			modules/docextract/lib/refextract_kbs.py
	M			modules/docextract/lib/refextract_linker.py
	M			modules/docextract/lib/refextract_re.py
	A			modules/docextract/lib/refextract_record.py
	M			modules/docextract/lib/refextract_regression_tests.py
	M			modules/docextract/lib/refextract_tag.py
	M			modules/docextract/lib/refextract_task.py
	M			modules/docextract/lib/refextract_text.py
	M			modules/docextract/lib/refextract_unit_tests.py
	D			modules/docextract/lib/refextract_xml.py
	M			modules/miscutil/lib/testutils.py