Page MenuHomec4science

bibindex-admin-guide.webdoc
No OneTemporary

File Metadata

Created
Sat, May 18, 03:26

bibindex-admin-guide.webdoc

## -*- mode: html; coding: utf-8; -*-
## $Id$
## This file is part of CDS Invenio.
## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 CERN.
##
## CDS Invenio is free software; you can redistribute it and/or
## modify it under the terms of the GNU General Public License as
## published by the Free Software Foundation; either version 2 of the
## License, or (at your option) any later version.
##
## CDS Invenio is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with CDS Invenio; if not, write to the Free Software Foundation, Inc.,
## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.
<!-- WebDoc-Page-Title: BibIndex Admin Guide -->
<!-- WebDoc-Page-Navtrail: <a class="navtrail" href="<WEBURL>/help/admin<lang:link/>">_(Admin Area)_</a> -->
<!-- WebDoc-Page-Revision: $Id$ -->
<p><table class="errorbox">
<thead>
<tr>
<th class="errorboxheader">
WARNING: BIBINDEX ADMIN GUIDE IS UNDER DEVELOPMENT
</th>
</tr>
</thead>
<tbody>
<tr>
<td class="errorboxbody">
BibIndex Admin Guide is not yet completed. Most of admin-level
functionality for BibIndex exists only in commandline mode. We
are in the process of developing both the guide as well as the
web admin interface. If you are interested in seeing some
specific things implemented with high priority, please contact us
at <CFG_SITE_SUPPORT_EMAIL>. Thanks for your interest!
</td>
</tr>
</tbody>
</table></p>
<h2>Contents</h2>
<strong>1.<a href="#1">Overview</a></strong><br/>
<strong>2. <a href="#2">Configure Metadata Tags and Fields</a></strong><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.1 <a href="#2.1">Configure Physical MARC Tags</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.2 <a href="#2.2">Configure Logical Fields</a><br/>
<strong>3. <a href="#3">Configure Word/Phrase Indexes</a></strong><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.1 <a href="#3.1">Define New Index</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.2 <a href="#3.2">Configure Word-Breaking Procedure</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.3 <a href="#3.3">Configure Stopwords List</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.4 <a href="#3.4">Configure Stemming</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.5 <a href="#3.5">Configure Word Length</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.6 <a href="#3.6">Configure Removal of HTML Code</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.7 <a href="#3.7">Configure Accent Stripping</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.8 <a href="#3.8">Configure Fulltext Indexing</a><br/>
<strong>4. <a href="#4">Run BibIndex Daemon</a></strong><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.1 <a href="#4.1">Run BibIndex daemon</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.2 <a href="#4.2">Checking and repairing indexes</a><br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4.3 <a href="#4.3">Reindexing</a><br/>
<a name="1"></a><h2>1. Overview</h2>
<a name="2"></a><h2>2. Configure Metadata Tags and Fields</h2>
<a name="2.1"></a><h3>2.1 Configure Physical MARC Tags</h3>
<a name="2.2"></a><h3>2.2 Configure Logical Fields</h3>
<a name="3"></a><h2>3. Configure Word/Phrase Indexes</h2>
<a name="3.1"></a><h3>3.1 Define New Index</h3>
<p>To define a new index you must first give the index a internal
name. An empty index is then created by preparing the database tables.</p>
<p>Before the index can be used for searching, the fields that should be
included in the index must be selected.</p>
<p>When desired to fill the index based on the fields selected, you can
schedule the update by running <b>bibindex -w indexname</b> together
with other desired parameters.</p>
<a name="3.2"></a><h3>3.2 Configure Word-Breaking Procedure</h3>
<p>Can be configured by changing
<b>CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS</b> and
<b>CFG_BIBINDEX_CHARS_PUNCTUATION</b> in the general config file.</p>
<p>How the words are broken up defines what is added to the index. Should
only "director-general" be added, or should "director", "general" and
"director-general" be added? The index can vary between 300 000 and 3
000 000 terms based the policy for breaking words.</p>
<a name="3.3"></a><h3>3.3 Configure Stopwords List</h3>
<p>BibIndex supports stopword removal by not adding words which exists in
a given stopword list to the index. Stopword removal makes the index
smaller by removing much used words.</p>
<p>Which stopword list that should be used can be configured in the
general config file file by changing the value of the variable
CFG_BIBINDEX_PATH_TO_STOPWORDS_FILE. If no stopword list should be
used, the value should be 0.</p>
<a name="3.4"></a><h3>3.4 Configure stemming</h3>
<p>The BibIndex indexer supports stemming, removing the ending of
words thus creating a smaller indexer. For example, using English, the
word "information" will be stemmed to "inform"; or "looking", "looks",
and "looked" will be all stemmed to "look", thus giving more hits to each
word.</p>
<p>Currently you can configure the stemming language on a per-index basis. All
searches referring a stemmed index will also be stemmed based on the same
language.</p>
<a name="3.5"></a><h3>3.5 Configure Word Length</h3>
<p>By setting the value of <b>CFG_BIBINDEX_MIN_WORD_LENGTH</b> in the
general config file higher than 0, only words with the number of
characters higher than this will be added to the index.</p>
<a name="3.6"></a><h3>3.6 Configure Removal of HTML and LaTeX Code</h3>
<p>By setting the value of <b>CFG_BIBINDEX_REMOVE_HTML_MARKUP</b> in
the general config file, the indexer may try to remove all HTML code
from documents before indexing, and index only the text left. (HTML
code is defined as everything between '&lt;' and '>' in a text.)</p>
<p>By setting the value of <b>CFG_BIBINDEX_REMOVE_LATEX_MARKUP</b> in
the general config file, the indexer may try to remove all LaTeX code
from documents before indexing, and index only the text left. (LaTeX
code is defined as everything between '\command{' and '}' in a text, or
'{\command ' and '}').</p>
<a name="3.7"></a><h3>3.7 Configure Accent Stripping</h3>
<a name="3.8"></a><h3>3.8 Configure Fulltext Indexing</h3>
<p>The metadata tags are usually indexed by its content. There are
special cases however, such as the fulltext indexing. In this case
the tag contains an URL to the fulltext material and we would like to
fetch this material and index words found in this material rather than
in the metadata itself. This is possible via special tag assignement
via <code>tagToWordsFunctions</code> variable.</p>
<p>The default setup is configured in the way that if the indexer sees
that it has to index tag <code>8564_u</code>, it switches into the
fulltext indexing mode described above. It can index locally stored
files or even fetch them from external URLs, depending on the value of
the <b>CFG_BIBINDEX_FULLTEXT_INDEX_LOCAL_FILES_ONLY</b> configuration
variable. When fetching files from remote URLs, when it ends on a
splash page (an intermediate page before getting to fulltext file
itself), it can find and follow any further links to fulltext files.</p>
<p>The default setup also differentiate between metadata and fulltext
indexing, so that <code>any field</code> index does process only
metadata, not fulltext. If you want to have the fulltext indexed
together with the metadata, so that both are searched by default, you
can go to BibIndex Admin interface and in the Manage Logical Fields
explicitly add the tag <code>8564_u</code> under <code>any
field</code> field.</p>
<a name="4"></a><h2>4. Run BibIndex Daemon</h2>
<a name="4.1"></a><h3>4.1 Run BibIndex daemon</h3>
<p>To index your newly created or modified documents, bibindex must be
run periodically via bibsched. This is achieved by the sleep option
(-s) to bibindex. For more information please see <a
href="../howto/run.html">HOWTO Run</a> admin guide.</p>
<a name="4.2"></a><h3>4.2 Checking and repairing indexes</h3>
<p>Upon each indexing run, bibindex checks and reports any
inconsistencies in the indexes. You can also manually check for the
index corruption yourself by using the check (-k) option to bibindex.</p>
<p>If a problem is found during the check, bibindex hints you to run
repairing (-r). If you run it, then during repair bibindex tries to
correct problems automatically by its own means. Usually it succeeds.</p>
<p>When the automatic repairing does not succeed though, then manual
intervention is required. The easiest thing to get the indexes back
to shape are commands like: (assuming the problem is with the index ID
1):
<blockquote>
<pre>
$ echo "DELETE FROM idxWORD01R WHERE type='TEMPORARY' or type='FUTURE';" | \
/opt/cds-invenio/bin/dbexec
</pre>
</blockquote>
to leave only the 'CURRENT' reverse index. After that you can rerun
the index checking procedure (-k) and, if successful, continue with
the normal web site operation. However, a full reindexing should be
scheduled for the forthcoming night or weekend.</p>
<a name="4.3"></a><h3>4.3 Reindexing</h3>
<p>The procedure of reindexing is taking place into the real indexes that
are also used for searching. Therefore the end users will feel
immediately any change in the indexes. If you need to reindex your
records from scratch, then the best procedure is the following: reindex
the collection index only (fast operation), recreate collection cache,
and only after that reindex all the other indexes (slow operation).
This will ensure that the records in your system will be at least
browsable while the indexes are being rebuilt. The steps to perform are:</p>
<p>First we reindex the collection index:
<blockquote>
<pre>
$ bibindex --reindex -f50000 -wcollection # reindex the collection index (fast)
$ echo "UPDATE collection SET reclist=NULL;" | \
/opt/cds-invenio/bin/dbexec # clean collection cache
$ webcoll -f # recreate the collection cache
$ bibsched # run the two above-submitted tasks
$ sudo apachectl restart
</pre>
</blockquote></p>
<p>Then we launch (slower) reindexing of the remaining indexes:
<blockquote>
<pre>
$ bibindex --reindex -f50000 # reindex other indexes (slow)
$ webcoll -f
$ bibsched # run the two above-submitted tasks, and put the queue back in auto mode
$ sudo apachectl restart
</pre>
</blockquote></p>
<p>You may optionally want to reindex the word ranking tables:
<blockquote>
<pre>
$ bibsched # wait for all active tasks to finish, and put the queue into manual mode
$ cd cds-invenio-0.92.1 # source dir
$ grep rnkWORD ./modules/miscutil/sql/tabbibclean.sql | \
/opt/cds-invenio/bin/dbexec # truncate rank indexes
$ echo "UPDATE rnkMETHOD SET last_updated='0000-00-00 00:00:00';" | \
/opt/cds-invenio/bin/dbexec # rewind the last ranking time
</pre>
</blockquote></p>
<p>Secondly, if you have been using custom ranking methods using new
rnkWORD* tables (most probably you have not), you would have to
truncate them too:
<blockquote>
<pre>
&nbsp; # find out which custom ranking indexes were added:
&nbsp; $ echo "SELECT id FROM rnkMETHOD" | /opt/cds-invenio/bin/dbexec
&nbsp; id
&nbsp; 66
&nbsp; 67
&nbsp; [...]
&nbsp;
&nbsp; # for every ranking index id, truncate corresponding ranking tables:
&nbsp; $ echo "TRUNCATE rnkWORD66F" | /opt/cds-invenio/bin/dbexec
&nbsp; $ echo "TRUNCATE rnkWORD66R" | /opt/cds-invenio/bin/dbexec
&nbsp; $ echo "TRUNCATE rnkWORD67F" | /opt/cds-invenio/bin/dbexec
&nbsp; $ echo "TRUNCATE rnkWORD67R" | /opt/cds-invenio/bin/dbexec
</pre>
</blockquote></p>
<p>At last, we launch reindexing of the ranking indexes:
<blockquote>
<pre>
$ bibrank -f50000
$ bibsched # run the three above-submitted tasks, and put the queue back in auto mode
$ sudo apachectl restart
</pre>
</blockquote>
and we are done.</p>
<p>In the future CDS Invenio should ideally run indexing into
invisible tables that would be switched against the production ones
once the indexing process is successfully over. For the time being,
if reindexing takes several hours in your installation (e.g. if you
have 1,000,000 records), you may want to mysqlhotcopy your tables and
run reindexing on those copies yourself.</p>

Event Timeline