Page MenuHomec4science

bibclassify-hep-taxonomy.webdoc
No OneTemporary

File Metadata

Created
Wed, Oct 2, 21:14

bibclassify-hep-taxonomy.webdoc

## $Id$
## This file is part of CDS Invenio.
## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007 CERN.
##
## CDS Invenio is free software; you can redistribute it and/or
## modify it under the terms of the GNU General Public License as
## published by the Free Software Foundation; either version 2 of the
## License, or (at your option) any later version.
##
## CDS Invenio is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with CDS Invenio; if not, write to the Free Software Foundation, Inc.,
## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.
<!-- WebDoc-Page-Title: The HEP taxonomy: rationale and extensions -->
<!-- WebDoc-Page-Navbar-Name: hacking-bibclassify -->
<!-- WebDoc-Page-Navtrail: <a class="navtrail" href="<WEBURL>/help/hacking">Hacking CDS Invenio</a> &gt; <a class="navtrail" href="bibclassify-internals"/>BibClassify Internals</a> -->
<!-- WebDoc-Page-Navbar-Select: hacking-bibclassify-index -->
<!-- WebDoc-Page-Revision: $Id$-->
<p> The DESY Library has been responsible for maintaining a thesaurus
of high energy physics (HEP) terms for a long time. The thesaurus is
currently used as a subject controlled vocabulary by HEP institutes
worldwide. The need to convert the HEP <a
href="http://www-library.desy.de/schlagw2.html">text
thesaurus</a> to a more complex <a href="http://cdswebdev.cern.ch/bibclassify/HEP.rdf">taxonomy</a>, with richer structure
and semantics, was mainly driven by the needs of the BibClassify system.
The current taxonomy oh high energy physics is expressed in the <a href="http://www.w3.org/2004/02/skos/">SKOS</a>
syntax. SKOS is a dialect of RDF and it is especially intended for the
representation of knowledge organization systems, such as thesauri,
taxonomies and basic ontologies.
<p><span class="adminbox">&nbsp;<b>NB.</b> The reasons behind the adoption of SKOS, instead of other similar
knowledge organization formats - notably <a href="http://www.w3.org/TR/owl-features/">OWL</a> - are to do with the simplicity, yet completeness, of SKOS.
The SKOS language contains all the basic properties that were needed to express the taxonomy and it allows straightforward conversion from text to RDF.
If you are interested in a more detailed discussion on this matter, please check the <a
href="https://savannah.cern.ch/cookbook/?func=detailitem&item_id=112">CERN-DESY email correspondence</a>.</span>
</p>
<p> In order to satisfy the needs of typical HEP classification
schemes and practices, the SKOS language had to be extended to include
additional properties. HEP keywords are often expressed as a
combination - a pair - of keywords. An example of this is:
<blockquote> <code>Born-Infeld model: monopole</code> </blockquote>
Here both <code>Born-Infeld model</code> and <code>monopole</code> are
standard HEP keywords (<em>single keywords</em>). However, they also
combine together to express a collective concept. We call it a
<em>composite keyword</em> and have extended the SKOS language in
order to include such new paradigm. The two property extensions
created for this purpose are:</p>
<ul>
<li> <code>composite</code>:
to express a relationship of <em>combination</em> with another
single keyword by pointing to the target composite keyword. It is a subProperty of the SKOS property <code>narrower</code></li>
<li> <code>compositeOf</code>: to
express the relationship of dependence of a composite keyword from two single concepts.
It is a subProperty of the SKOS property <code>broader</code></li>
</ul>
<p>
These two extensions are probably best described by an example.
The single keyword <code>Born-Infeld model</code> is expressed in the HEP taxonomy as:
<blockquote><pre>
&lt;Concept rdf:about="http://cern.ch/thesauri/HEP.rdf#Born-Infeldmodel"&gt;
&lt;prefLabel xml:lang="en"&gt;Born-Infeld model&lt;/prefLabel&gt;
&lt;hiddenLabel xml:lang="en"&gt;Born-Infeld&lt;/hiddenLabel&gt;
&lt;altLabel xml:lang="en"&gt;DBI&lt;/altLabel&gt;
&lt;broader rdf:resource="http://cern.ch/thesauri/HEP.rdf#fieldtheoreticalmodel"/&gt;
&lt;composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelrelativistic"/&gt;
&lt;composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelnonlinear"/&gt;
&lt;composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelnonabelian"/&gt;
&lt;composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelmonopole"/&gt;
&lt;composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelchiral"/&gt;
&lt;/Concept&gt;
</pre></blockquote>
The concept contains all the usual SKOS tags to express the relations
and denominations of a concept (<code>prefLabel</code>,
<code>broader</code>, etc.). In addition, it contains five
<code>composite</code> tags: these link to five different combinations
of this keyword with fellow single keywords. For example one of these points to the
composite keyword <code>Born-Infeld model: monopole</code>, whose
entry is:
<blockquote><pre>
&lt;Concept rdf:about="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelmonopole"&gt;
&lt;prefLabel xml:lang="en"&gt;Born-Infeld model: monopole&lt;/prefLabel&gt;
&lt;compositeOf rdf:resource="http://cern.ch/thesauri/HEP.rdf#Born-Infeldmodel"/&gt;
&lt;compositeOf rdf:resource="http://cern.ch/thesauri/HEP.rdf#monopole"/&gt;
&lt;/Concept&gt;
</pre></blockquote>
</p>
<p> The structure of single and composite keywords, as well as their
associations expressed by properties <code>composite</code> and
<code>compositeOf</code> are self-evident. By using such a model, we
are able to efficiently extract keyword pairs from fulltext, as
explained in the <a href="bibclassify-extraction-algorithm">BibClassify extraction guide</a>.
</p>
<p> Finally, it is worth pointing out a couple of other syntax
practices that might be specific only to the HEP taxonomy:
</p>
<ul> <li>
<code>hiddenLabel</code>: this lexical label (currently in unstable
state) is used primarily to define alternative diplays of a keyword
that are meant to be hidden from the user, but are still accessible to
text parsing operations. In the HEP taxonomy, these include
mispellings and wildcards (stricly expressed as legal regular
expressions).</li> <li> <code>note</code>: this label is currently
reserved to describe a <code>nostandalone</code> condition - a
property of those single keywords that can only appear as part of
composite keywords (their occurrence as standalones is discarded).
</li></ul>

Event Timeline