bibclassify-extraction-algorithm.webdoc
No OneTemporary
Actions

Subscribers

None

File Metadata

Created: Sat, Jul 12, 07:57

bibclassify-extraction-algorithm.webdoc
View Options

	## -- mode: html; coding: utf-8; --

	## This file is part of Invenio.
	## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010 CERN.
	##
	## Invenio is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## Invenio is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with Invenio; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	<!-- WebDoc-Page-Title: The code behind BibClassify: the extraction algorithm -->
	<!-- WebDoc-Page-Navtrail: <a class="navtrail"
	href="<CFG_SITE_URL>/help/hacking">Hacking Invenio</a> >
	<a class="navtrail" href="bibclassify-internals"/>BibClassify Internals</a> -->
	<!-- WebDoc-Page-Revision: $Id$ -->


	<h2>Contents</h2>

	<strong>1. <a href="#1">Overview</a></strong><br />
	<strong>2. <a href="#2">Taxonomy handling</a></strong><br />
	<strong>3. <a href="#3">Fulltext management</a></strong><br />
	<strong>4. <a href="#4">Single Keyword processing</a></strong><br />
	<strong>5. <a href="#5">Composite keyword processing</a></strong><br />
	<strong>6. <a href="#6">Post-processing</a></strong><br />
	<strong>7. <a href="#7">Author keywords</a></strong><br />

	<a name="1"></a><h2>1. Overview</h2>

	<p>This section provides a detailed account of the phrase matching techniques
	used by BibClassify to automatically extract significant terms from fulltext
	documents. While reading this guide, you are advised to refer to the original
	BibClassify code.</p>

	<p>BibClassify extracts 2 types of keywords from a fulltext documents:
	single/main keywords and composite keywords. Single keywords are keywords
	composed of one or more words ("scalar" or "field theory"). Composite keywords
	are composed of several single keywords ("field theory: scalar") and are
	considered as such if the single keywords are found combined in the fulltext.
	All keywords are stored in a RDF/SKOS taxonomy or in a simple keyword file.
	When using the keyword file, it is only possible to extract single keywords.</p>

	<p>The bulk of the extraction mechanism takes place inside the functions
	<code>get_single_keywords</code> and <code>get_composite_keywords</code> in
	<code>bibclassify_keyword_analyzer.py</code>.</p>

	<a name="2"></a><h2>2. Taxonomy handling</h2>

	<p>This paragraph explains the code of
	<code>bibclassify_ontology_handler.py</code>.</p>

	<p>BibClassify handles the taxonomy differently whether it is running in
	standalone file mode (from the sources) or as an Invenio module. In both
	cases, the taxonomy is specified through the <code>-k, --taxonomy</code>
	option. In standalone file mode, the argument has to be a path when in normal
	mode. In module mode, the argument refers to the ontology short name found in
	the clsMETHOD table (e.g. "HEP" for the taxonomy "HEP.rdf"). However the
	ontology long name ("HEP.rdf") or even its reference URL do also work. The
	reference URL is stored in the table clsMETHOD in the column "location".</p>

	<p>In standalone file mode, we just compare the date of modification of the
	taxonomy file and the date of creation of the cache file. If the cache is
	older than the ontology, we regenerate it.</p>

	<p>In module mode, we first check the modification date of the reference
	ontology by performing a HTTP HEAD request. We compare this date with the date
	of the locally stored ontology. If needed we download the newer ontology. This
	ensures that BibClassify always uses the latest ontology available. The cache
	management is similar to the standalone mode.</p>

	<p>In order to generate the cache file, the taxonomy is stored and parsed into
	memory using <a href="http://rdflib.net/">RDFLib</a>.</p>

	<p>The cache consists of dictionaries of SingleKeyword and CompositeKeyword
	objects. These objects contain a meaningful description of the keywords and
	regular expressions in a compiled form that allow to find the keywords in the
	fulltext. These regular expressions are described in paragraph 4.</p>

	<a name="3"></a><h2>3. Fulltext management</h2>

	<p>This paragraph discusses the way BibClassify manages the fulltext of
	records. Source code discussed can be found in
	<code>bibclassify_text_extractor</code> and
	<code>bibclassify_text_normalizer</code>.</p>

	<p>The code of <code>bibclassify_text_extractor.py</code> will soon be updated
	and therefore the documentation for this module is pending.</p>

	<p>The extraction of PDF documents in the field of HEP can lead to some
	inconsistencies in the document because of mathematical formulas and Greek
	letters. <code>bibclassify_text_normalizer.py</code> takes care of these
	problems by running a set of correcting regular expressions on the extracted
	fulltext. These regular expressions can be found in the configuration file of
	BibClassify.</p>

	<a name="4"></a><h2>4. Single Keyword processing</h2>

	<p>For each single and composite keyword, the taxonomy contains different labels:
	<ul>
	<li>preferred label: the preferred form of the keyword.</li>
	<li>alternative label: different forms of the keyword that can be found in the
	fulltext.</li>
	<li>hidden label: regular expressions that take care of particular forms such as
	weird plurals e.g.</li>
	</ul>
	</p>

	<p>For each of these labels, we compile and cache regular expressions. The way
	the regular expressions are built is described in the configuration file of
	BibClassify.</p>

	<p>When searching for single keywords in a fulltext, we run the corresponding
	set of regular expressions on the text and store the number of matches and
	the position of the keywords in the text.</p>

	<a name="5"></a><h2>5. Composite keyword processing</h2>

	<p>For each composite keyword, we first run the regular expressions
	corresponding to alternative and hidden labels. This is similar to the search
	for single keywords.</p>

	<p>Then, for each composite keyword, we check if all of its components were
	found in the fulltext. If this is the case, then we check the positions of the
	single keywords in the text. If the single keywords are placed nearby, then we
	found a composite keyword. If not, then we check if the words placed between
	the single keywords are valid separators (configured in the configuration file
	of BibClassify).</p>

	<p>The result of this operation is a list of composite keywords with the total
	number of occurrences. Occurrences for all concerned single keywords are also
	attached to this list.</p>

	<a name="6"></a><h2>6. Post-processing</h2>

	<p>Before presenting the results to the user, some extra filtering occurs,
	primarily to refine the output keywords. The main post-processing actions
	performed on the results are:
	<ul>
	<li>Remove single keywords that are part of a composite keyword found in the
	document.</li>
	<li>Remove single keywords that are noted as non standalone.</li>
	<li>Produce the desired output: standard or SPIRES.</li>
	</ul>
	</p>

	<p>The final results that are produced to the user consist of the 20 first
	(configurable) best single keywords and best composite keywords. The results
	may be presented in different formats (text output or MARCXML). Sample text
	output can be found in the
	<a href="bibclassify-admin-guide">BibClassify Admin Guide</a>.
	</p>

	<a name="7"></a><h2>7. Author keywords</h2>

	<p>BibClassify extracts also author keywords when the option
	'--detect-author-keywords' is set. BibClassify searches for the string of
	keywords in the fulltext. Then it separates them and outputs them.</p>

bibclassify-extraction-algorithm.webdocNo OneTemporaryActions

File Metadata

bibclassify-extraction-algorithm.webdocView Options

Event Timeline

bibclassify-extraction-algorithm.webdoc
No OneTemporary
Actions

bibclassify-extraction-algorithm.webdoc
View Options