Diffusion invenio-infoscience 103aef0217c6

refextract: author identification
103aef0217c6
Actions

Authored by Christopher Hayward <christopher.james.hayward@cern.ch> on Oct 11 2010, 10:46.

Description

refextract: author identification

Identifies Authors in citations. Splits references based on the number of author groups found, and the presence of semi-colons. Completely refactored how the output MARC-XML is created, removed a lot of redundant methods.

Authors are identified as 'groups' of authors within citations. Multiple groups may indicate that a reference is actually two citations. The accurate classification of author names is reliant upon a large and extensive titles knowledge base, since tagged titles will not be tagged as authors afterwards. Found authors also helps to identify useful semi-colons, since author tagging limits the text that is dumped into the misc-subfield. Author groups can include words such as 'and' and 'et al'. Also, if an 'and' is located at the start of an author group, then a weaker author pattern is applied to the preceeding misc text, which is likely to hold an author that was not correctly matched.

On top of this, authors which look like editors (have an 'ed' phrase somewhere around the author, in some format, are not tagged as authors, since they do not dictate multiple citations inside a single reference line.

The methods which control the conversion of a tagged reference line to MARC-XML has been completely re-written, using the same branches of execution that were applied to the previous methods. However, the new methods are not only much easier to understand, but also take into consideration all of the tagged elements in a citation line when making the decision to split a reference line into two or more citations. The management of IBID's has also been simplified, by attaching a list of IBID dictionaries to the starting title they apply to.

Details

Committed

Tibor Simko <tibor.simko@cern.ch>

Nov 23 2011, 00:25

Parents

R3600:3e3964d9beb0: refextract: improvements

Branches

Unknown

Tags

Unknown

Event Timeline

Tibor Simko <tibor.simko@cern.ch> committed R3600:103aef0217c6: refextract: author identification (authored by Christopher Hayward <christopher.james.hayward@cern.ch>).Nov 23 2011, 00:25

Changes (1)

				Path
	M			modules/bibedit/lib/refextract.py

R3600:103aef0217c6

View Options

modules/bibedit/lib/refextract.py