<p> This section provides a detailed account of the phrase matching
techniques used by BibClassify to automatically extract significant terms
from fulltext documents. While reading this guide, you are advised to refer to the original
BibClassify code, mostly contained in <code>lib/python/invenio/bibclassifylib.py</code>.
This guide refers to version 2006/09/15.
<p>The bulk of the extraction mechanism takes place inside the function
<code>generate_keywords_rdf</code>. This function is triggered when
BibClassify is launched with parameter <code>-K,
--taxonomy=FILENAME</code>. Let's have a look at what happens inside this function, step-by-step.
<p><span class="adminbox"> <b>NB.</b> BibClassify can also run on top of simple text thesauri
(parameter <code>-k, --thesaurus=FILENAME</code>, function
<code>generate_keywords</code>), however this mode is now deprecated
and no longer maintained.</span>
<a name="2"></a><h2>2. Preprocessing</h2>
<p> At the beginning of the function, various local variables are
declared. Among these,
<ul>
<li><code>namespace</code>: This variable
points to the main rdf:Namespace used in the taxonomy. In the case of
RDF SKOS, this is
<code>http://www.w3.org/2004/02/skos/core#</code>. If you need to use namespaces other than this one, please modify this variable
accordingly.</li>
<li><code>delimiter</code>: This variable represents the delimiter
symbol used to separate composite keywords. For example, current HEP
taxonomy adopts ":" (e.g. <em>baryon: asymmetry</em>) whereas the SPIRES standard
adopts "," (e.g. <em>baryon, asymmetry</em>). If you intend to set up and use
composite keywords in your taxonomy, please set this variable to the desired format.
</ul>
<p>The taxonomy (<code>dictfile</code>) is stored and parsed into memory via <a href="http://rdflib.net/">RDFlib</a> by the following two lines of code:
<blockquote>
<code>
store = rdflib.Graph()<br>
store.parse(dictfile)
</code>
</blockquote>
<p><span class="adminbox"> <b>NB.</b> RDFLib provides very handy
libraries for RDF manipulation, however when dealing with large RDF
files, loading and parsing are by far the principal
factor affecting the performance of BibClassify. For example, when
loading the <a href="http://cdswebdev.cern.ch/bibclassify/HEP.rdf">HEP taxonomy</a> (7.4 MB, 16000 Concepts) on an Intel(R)
Xeon(TM) CPU 3.06GHz the two lines of code above take a total of <b>26
seconds</b> to complete (over two thirds of the total execution time - 36 seconds).
If performance is your main concern, consider a faster library or
working on a pre-loaded RDF store.
</span>
<p>At this point, the fulltext of the document is converted from text
to PDF (using standard linux command <code>pdftotext</code>) and stored into a
string <code>text_string</code>. This string will contain the full
document if running in slow mode or an arbitrary excerpt of the
document (about 40%) if running in fast mode. Moreover, the very
beginning of the string (10%) is stored in a variable called
<code>abstract</code>. This is done base on the assumption that
manuscripts generally contain crucial information such as title and
abstract in the very first portion of the document. This portion can
be then treated to be more relevant than the remainder of the document. Please
bear this in mind if running BibClassify on documents with different
structures or when running on heterogeneous collections.
<p> In many manuscripts, the author includes a list of pre-assigned
terms that describe the topic of the article. The last step before
keyword extraction begins is to locate these author-assigned
keywords. We try to isolate these by searching for the key phrase
<em>Keywords</em> followed by a list of terms. When found, the string
is stored into variable <code>safe_keys</code> and used later to match
BibClassify output against author assigned keywords (these are marked
in the output with an asterisk, e.g. <code>13* Hubble constant</code>)
<a name="3"></a><h2>3. Single Keyword (mkw) processing</h2>
<p> The bulk of the phrase matching operations - the extraction of the single
keywords from the fulltext - is contained in a big <code>for</code> loop. In this
loop, every RDF <code>Concept</code> is parsed, one at a time, and its
components (such as <code>prefLabel</code> and <code>altLabel</code>) matched inside the document.
<p><span class="adminbox"> <b>NB.</b> For a detailed explanation
<p><span class="adminbox"> <b>NB.</b> One could add at this point
the possibility of having combinations of more than two mkws, e.g.
<em>ckwURI : [mkw1URI, mkw2URI, mkw3URI]</em>. This is feasible, but
was not implemented at this stage, because of the performance overhead that would
be generated by the phrase matching of more complex regular expressions.</span>
<p>Once the <code>composites</code> dictionary is completed, its keys
that point to lists of two values (like the example above) are ckw
candidates. We now need to check whether they actually appear one next
to the other in the text. This is done in function
<code>makeCompPattern</code> by compiling a pair of regular
expressions: one for <em>mkw1</em> followed by <em>mkw2</em> and one
for the inverse situation, <em>mkw2</em> followed by <em>mkw1</em>. Once again,
when compiling the regex pattern we have to take extra care to treat
special cases (hyphens, short names, wildcards) accordingly. The sum
of the incidence of the two patterns in the <code>text_string</code> is
stored in a list (<code>compositesOUT</code>) that is then sorted (by
occurrence) and output.
<a name="5"></a><h2>5. Postprocessing</h2>
<p> Before presenting the results to the user, some extra filtering
occurs, primarily to refine the output keywords. The main postprocessing actions performed on the results are:
<ul>
<li> Ensure that the order of the occurrence counter ([n1,n2]) for composite keywords (e.g. <em>baryon: asymmetry [7, 12]</em>) is correct.</li>
<li> Ensure that "stray" wildcard labels (e.g. hiddenLabels of composite keywords) do not cause double phrase matching.</li>
<li> Produce the desired output using the chosen ckw delimiter: the BibClassify standard (:) or the SPIRES one (,).</li>
<li> Filter out single keywords that match one into each other, e.g. if <em>magnetic</em> and <em>magnetic field</em> appear among mkws, subtract the occurrence of <em>magnetic</em> from <em>magnetic field</em>.
<p><span class="adminbox"> <b>NB.</b> One could also perform this
last post-processing step at the composite keyword level, e.g. <em>energy:
density</em> to be overridden by <em>dark energy: density</em>. This has
not been yet implemented for security reasons (the incidence of altLabels
on the ckw computation).</li></ul>
<p> The final results that are produced to the user consist of the
first <code>n</code> entries of <code>keylist</code> (the single
keywords) and the entries in <code>compositesOUT</code> (the composite
keywords). The results may be presented in text or html format,
according to the output mode chosen at the command line. Sample output (both text and html) can be found in the