Page MenuHomec4science

bibclassify-extraction-algorithm.webdoc
No OneTemporary

File Metadata

Created
Wed, Oct 16, 02:57

bibclassify-extraction-algorithm.webdoc

## -*- mode: html; coding: utf-8; -*-
## This file is part of Invenio.
## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010 CERN.
##
## Invenio is free software; you can redistribute it and/or
## modify it under the terms of the GNU General Public License as
## published by the Free Software Foundation; either version 2 of the
## License, or (at your option) any later version.
##
## Invenio is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with Invenio; if not, write to the Free Software Foundation, Inc.,
## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.
<!-- WebDoc-Page-Title: The code behind BibClassify: the extraction algorithm -->
<!-- WebDoc-Page-Navtrail: <a class="navtrail"
href="<CFG_SITE_URL>/help/hacking">Hacking Invenio</a> &gt;
<a class="navtrail" href="bibclassify-internals"/>BibClassify Internals</a> -->
<!-- WebDoc-Page-Revision: $Id$ -->
<h2>Contents</h2>
<strong>1. <a href="#1">Overview</a></strong><br />
<strong>2. <a href="#2">Taxonomy handling</a></strong><br />
<strong>3. <a href="#3">Fulltext management</a></strong><br />
<strong>4. <a href="#4">Single Keyword processing</a></strong><br />
<strong>5. <a href="#5">Composite keyword processing</a></strong><br />
<strong>6. <a href="#6">Post-processing</a></strong><br />
<strong>7. <a href="#7">Author keywords</a></strong><br />
<a name="1"></a><h2>1. Overview</h2>
<p>This section provides a detailed account of the phrase matching techniques
used by BibClassify to automatically extract significant terms from fulltext
documents. While reading this guide, you are advised to refer to the original
BibClassify code.</p>
<p>BibClassify extracts 2 types of keywords from a fulltext documents:
single/main keywords and composite keywords. Single keywords are keywords
composed of one or more words ("scalar" or "field theory"). Composite keywords
are composed of several single keywords ("field theory: scalar") and are
considered as such if the single keywords are found combined in the fulltext.
All keywords are stored in a RDF/SKOS taxonomy or in a simple keyword file.
When using the keyword file, it is only possible to extract single keywords.</p>
<p>The bulk of the extraction mechanism takes place inside the functions
<code>get_single_keywords</code> and <code>get_composite_keywords</code> in
<code>bibclassify_keyword_analyzer.py</code>.</p>
<a name="2"></a><h2>2. Taxonomy handling</h2>
<p>This paragraph explains the code of
<code>bibclassify_ontology_handler.py</code>.</p>
<p>BibClassify handles the taxonomy differently whether it is running in
standalone file mode (from the sources) or as an Invenio module. In both
cases, the taxonomy is specified through the <code>-k, --taxonomy</code>
option. In standalone file mode, the argument has to be a path when in normal
mode. In module mode, the argument refers to the ontology short name found in
the clsMETHOD table (e.g. "HEP" for the taxonomy "HEP.rdf"). However the
ontology long name ("HEP.rdf") or even its reference URL do also work. The
reference URL is stored in the table clsMETHOD in the column "location".</p>
<p>In standalone file mode, we just compare the date of modification of the
taxonomy file and the date of creation of the cache file. If the cache is
older than the ontology, we regenerate it.</p>
<p>In module mode, we first check the modification date of the reference
ontology by performing a HTTP HEAD request. We compare this date with the date
of the locally stored ontology. If needed we download the newer ontology. This
ensures that BibClassify always uses the latest ontology available. The cache
management is similar to the standalone mode.</p>
<p>In order to generate the cache file, the taxonomy is stored and parsed into
memory using <a href="http://rdflib.net/">RDFLib</a>.</p>
<p>The cache consists of dictionaries of SingleKeyword and CompositeKeyword
objects. These objects contain a meaningful description of the keywords and
regular expressions in a compiled form that allow to find the keywords in the
fulltext. These regular expressions are described in paragraph 4.</p>
<a name="3"></a><h2>3. Fulltext management</h2>
<p>This paragraph discusses the way BibClassify manages the fulltext of
records. Source code discussed can be found in
<code>bibclassify_text_extractor</code> and
<code>bibclassify_text_normalizer</code>.</p>
<p>The code of <code>bibclassify_text_extractor.py</code> will soon be updated
and therefore the documentation for this module is pending.</p>
<p>The extraction of PDF documents in the field of HEP can lead to some
inconsistencies in the document because of mathematical formulas and Greek
letters. <code>bibclassify_text_normalizer.py</code> takes care of these
problems by running a set of correcting regular expressions on the extracted
fulltext. These regular expressions can be found in the configuration file of
BibClassify.</p>
<a name="4"></a><h2>4. Single Keyword processing</h2>
<p>For each single and composite keyword, the taxonomy contains different labels:
<ul>
<li>preferred label: the preferred form of the keyword.</li>
<li>alternative label: different forms of the keyword that can be found in the
fulltext.</li>
<li>hidden label: regular expressions that take care of particular forms such as
weird plurals e.g.</li>
</ul>
</p>
<p>For each of these labels, we compile and cache regular expressions. The way
the regular expressions are built is described in the configuration file of
BibClassify.</p>
<p>When searching for single keywords in a fulltext, we run the corresponding
set of regular expressions on the text and store the number of matches and
the position of the keywords in the text.</p>
<a name="5"></a><h2>5. Composite keyword processing</h2>
<p>For each composite keyword, we first run the regular expressions
corresponding to alternative and hidden labels. This is similar to the search
for single keywords.</p>
<p>Then, for each composite keyword, we check if all of its components were
found in the fulltext. If this is the case, then we check the positions of the
single keywords in the text. If the single keywords are placed nearby, then we
found a composite keyword. If not, then we check if the words placed between
the single keywords are valid separators (configured in the configuration file
of BibClassify).</p>
<p>The result of this operation is a list of composite keywords with the total
number of occurrences. Occurrences for all concerned single keywords are also
attached to this list.</p>
<a name="6"></a><h2>6. Post-processing</h2>
<p>Before presenting the results to the user, some extra filtering occurs,
primarily to refine the output keywords. The main post-processing actions
performed on the results are:
<ul>
<li>Remove single keywords that are part of a composite keyword found in the
document.</li>
<li>Remove single keywords that are noted as non standalone.</li>
<li>Produce the desired output: standard or SPIRES.</li>
</ul>
</p>
<p>The final results that are produced to the user consist of the 20 first
(configurable) best single keywords and best composite keywords. The results
may be presented in different formats (text output or MARCXML). Sample text
output can be found in the
<a href="bibclassify-admin-guide">BibClassify Admin Guide</a>.
</p>
<a name="7"></a><h2>7. Author keywords</h2>
<p>BibClassify extracts also author keywords when the option
'--detect-author-keywords' is set. BibClassify searches for the string of
keywords in the fulltext. Then it separates them and outputs them.</p>

Event Timeline