Page MenuHomec4science

word_similarity.html.wml
No OneTemporary

File Metadata

Created
Tue, Nov 12, 21:16

word_similarity.html.wml

## $Id$
## This file is part of the CERN Document Server Software (CDSware).
## Copyright (C) 2002, 2003, 2004, 2005 CERN.
##
## The CDSware is free software; you can redistribute it and/or
## modify it under the terms of the GNU General Public License as
## published by the Free Software Foundation; either version 2 of the
## License, or (at your option) any later version.
##
## The CDSware is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with CDSware; if not, write to the Free Software Foundation, Inc.,
## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.
#include "cdspage.wml" \
title="Word Similarity/Similar Records Methods" \
navbar_name="hacking-bibrank" \
navtrail_previous_links="<a class=navtrail href=<WEBURL>/hacking/>Hacking CDSware</a> &gt; <a class=navtrail href=<WEBURL>/hacking/bibrank/>BibRank Internals</a> " \
navbar_select="hacking-bibrank-index"
<pre>
<blockquote>
<b>Important regarding 'Word Similarity/Similar Records' method:</b>
Because of tight connection with the search_engine and database, there
can only be one method using the "word_similarity" template, and the
code for this method as designated in the BibRank admin interface has
to be 'wrd'.
<b>Preparing word similarity rank indexes:</b>
The 'Word similarity' method works by generating an index over terms in
the tags specified in the configuration file for the given records. The
data is stored in two tables, on forward and one reverse. The forward
index has a list of terms, where each term as a dictionary containing
records using this term. The reverse index contains the opposite, a list
of records, where each record contains a dictionary of the terms it
contains (for the selected tags). This means that the forward/reverse
index is to some degree similar to the tables created by BibIndex.
The main difference is that the rank method stores more information in
the table, based on how important the terms are, based on how many
times they have been used, and how important one term is in one
record. To minimize the number of terms to process, a few techniques
are used. Among these are stemming and stopword removal. Stemming
removes the end of a term, so that only the stem is left, this means
that 'looking' becomes 'look' and minimizes the size of the
database. Stopword removal removes very common words without meaning,
like 'the', 'one', 'me' in english. Terms that consists of numbers or
are below a certain limit is also ignored.
Since automatic language recognisition is not supported, each tag must
therefore be given a language for stemming to work. This means that in
the perfect world one tag should contain text in one language or
mostly in one language. If stemming is not wanted, the module can be
turned off, though lower rank quality may be expected.
Stopword removal works by checking if a term exists in a file, which
can contain any language necessary. Together with the default cdsware
installation, the file contains stopwords in french and english.
For a large CDSware installation (700 000 records), indexing takes
around 2 hours, including calculating the data needed for the
weighting scheme.
How the term importance is computed:
The method used is a variation of the well-known weighting scheme, the
vector model [1], in document retrieval. The method is described in
[2] and called 'Log-entropy' weigthing scheme. For more detailed
explanation of the scheme, the paper should be consulted. Since the
calculations necessary to calculate the number needed by the method is
too demanding, most of the numbers are calculated after the index over
term is created and stored in the database for later use.
<b>Step by step at index time:</b>(Using rebalance)
1. Load configuration for method
2. Begin a loop which loops through all records that should be added
3. Load content of tags in current record range
4. For each tag in all records, check each term if it should be used,
check against stopword list, use stemming, if accepted, add to a structure
the points from configfile for current tag.
5. Add to database the new values
6. Go back to 3 until no more records to be added.
7. Go through list of added terms, get list of all records containing these terms
8. Find all terms in records from last point.
9. For each record, calculate Fi, for each term calculate Gi
10.For each record, calculate normalization value Nj, add Gi value for each
term to the structure in reverse index.
11.Adding the Gi value to each term in forward index, and adding the
normalization value Nj to each term in each document
<b>Word Similarity at search/rank time:</b>
1. For each term, check if it can be used (like check agains stopword list),
use stemming on the term if possible.
2. For each term, get dictionary from forward index, calculate rank values
for each term.
3. Add any records not ranked to end of list.
4. Sort records.
5. Return sorted records
<b>Similar records at search/rank time:</b>
1. Get terms from reverse index which exists in record.
2. Sort terms and use only the most important ones for finding similar records.
3. For selected terms, rank the associated records
4. Sort records
5. Return sorted records
<b>References:</b>
[1] Modern Information Retrieval. Baeza-Yates/Ribeiro-Neto
[2] New term weighting formulas for the vector space method in information retrieval.
ORNL/TM-13756.E.Chisholm/T.G.Kolda
</pre>

Event Timeline