Homec4science

+ Fixed a bug when looking for the end of the references section. Previously…

Authored by Nicholas Robinson <nicholas.robinson@cern.ch> on Mar 29 2007, 14:59.

Description

+ Fixed a bug when looking for the end of the references section. Previously, there was a regexp searching for a certain pattern of digits (something that occurs sometimes when the figures/tables are converted to text). Unfotunately, in certain cases, the regexp search was horribly long - infinite, maybe! It was fixed by removing the regexp pattern and using some methods of the string object, such as replace, is_digit, etc. Additionally, this also seems to have improved the recognition of the end of a reference section. + To identify different pages of a PDF document, refextract was looking for a page-break character (\\f) on its own line because pdftotext always put this character in a line of its own. However, in a new version of this tool (3.01 onwards?), this character is not necessarily in its own line. This caused some problems when searching for headers/footers, etc. Therefore, when text from pdftotext is read-in by refextract, it now adds this char into its own line, should it come at the start of a line; + Added a new numeration-recognition pattern (and subsequent handling code) that is used when transforming a tagged citation line into MARC XML. The new pattern looks for tagged numeration, and is applied immediately after a title + numeration pattern has been applied. This handles IBIDs that do not actually use the word "IBID". E.g.: <cds.TITLE>J. Phys. A</cds.TITLE> : <cds.VOL>31</cds.VOL> <cds.YR>(1998) </cds.YR> <cds.PG>2391</cds.PG>; : <cds.VOL>32</cds.VOL> <cds.YR>(1999) </cds.YR> <cds.PG>6119</cds.PG>. The 2nd group of numeration clearly belongs with the title - the author has simply missed out the title. Previously reference this would have been missed. Now however, it will be recognised;

Event Timeline

Nicholas Robinson <nicholas.robinson@cern.ch> committed R3600:54dcf3807ddf: + Fixed a bug when looking for the end of the references section. Previously… (authored by Nicholas Robinson <nicholas.robinson@cern.ch>).Mar 29 2007, 14:59