word_similarity.html.wml
No OneTemporary
Actions

Subscribers

None

File Metadata

Created: Tue, Nov 12, 21:16

word_similarity.html.wml
View Options

	## $Id$

	## This file is part of the CERN Document Server Software (CDSware).
	## Copyright (C) 2002, 2003, 2004, 2005 CERN.
	##
	## The CDSware is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## The CDSware is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with CDSware; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	#include "cdspage.wml" \
	title="Word Similarity/Similar Records Methods" \
	navbar_name="hacking-bibrank" \
	navtrail_previous_links="<a class=navtrail href=<WEBURL>/hacking/>Hacking CDSware</a> > <a class=navtrail href=<WEBURL>/hacking/bibrank/>BibRank Internals</a> " \
	navbar_select="hacking-bibrank-index"

	<pre>
	<blockquote>
	<b>Important regarding 'Word Similarity/Similar Records' method:</b>
	Because of tight connection with the search_engine and database, there
	can only be one method using the "word_similarity" template, and the
	code for this method as designated in the BibRank admin interface has
	to be 'wrd'.

	<b>Preparing word similarity rank indexes:</b>
	The 'Word similarity' method works by generating an index over terms in
	the tags specified in the configuration file for the given records. The
	data is stored in two tables, on forward and one reverse. The forward
	index has a list of terms, where each term as a dictionary containing
	records using this term. The reverse index contains the opposite, a list
	of records, where each record contains a dictionary of the terms it
	contains (for the selected tags). This means that the forward/reverse
	index is to some degree similar to the tables created by BibIndex.

	The main difference is that the rank method stores more information in
	the table, based on how important the terms are, based on how many
	times they have been used, and how important one term is in one
	record. To minimize the number of terms to process, a few techniques
	are used. Among these are stemming and stopword removal. Stemming
	removes the end of a term, so that only the stem is left, this means
	that 'looking' becomes 'look' and minimizes the size of the
	database. Stopword removal removes very common words without meaning,
	like 'the', 'one', 'me' in english. Terms that consists of numbers or
	are below a certain limit is also ignored.

	Since automatic language recognisition is not supported, each tag must
	therefore be given a language for stemming to work. This means that in
	the perfect world one tag should contain text in one language or
	mostly in one language. If stemming is not wanted, the module can be
	turned off, though lower rank quality may be expected.

	Stopword removal works by checking if a term exists in a file, which
	can contain any language necessary. Together with the default cdsware
	installation, the file contains stopwords in french and english.

	For a large CDSware installation (700 000 records), indexing takes
	around 2 hours, including calculating the data needed for the
	weighting scheme.

	How the term importance is computed:
	The method used is a variation of the well-known weighting scheme, the
	vector model [1], in document retrieval. The method is described in
	[2] and called 'Log-entropy' weigthing scheme. For more detailed
	explanation of the scheme, the paper should be consulted. Since the
	calculations necessary to calculate the number needed by the method is
	too demanding, most of the numbers are calculated after the index over
	term is created and stored in the database for later use.

	<b>Step by step at index time:</b>(Using rebalance)
	1. Load configuration for method
	2. Begin a loop which loops through all records that should be added
	3. Load content of tags in current record range
	4. For each tag in all records, check each term if it should be used,
	check against stopword list, use stemming, if accepted, add to a structure
	the points from configfile for current tag.
	5. Add to database the new values
	6. Go back to 3 until no more records to be added.
	7. Go through list of added terms, get list of all records containing these terms
	8. Find all terms in records from last point.
	9. For each record, calculate Fi, for each term calculate Gi
	10.For each record, calculate normalization value Nj, add Gi value for each
	term to the structure in reverse index.
	11.Adding the Gi value to each term in forward index, and adding the
	normalization value Nj to each term in each document

	<b>Word Similarity at search/rank time:</b>
	1. For each term, check if it can be used (like check agains stopword list),
	use stemming on the term if possible.
	2. For each term, get dictionary from forward index, calculate rank values
	for each term.
	3. Add any records not ranked to end of list.
	4. Sort records.
	5. Return sorted records

	<b>Similar records at search/rank time:</b>
	1. Get terms from reverse index which exists in record.
	2. Sort terms and use only the most important ones for finding similar records.
	3. For selected terms, rank the associated records
	4. Sort records
	5. Return sorted records

	<b>References:</b>
	[1] Modern Information Retrieval. Baeza-Yates/Ribeiro-Neto
	[2] New term weighting formulas for the vector space method in information retrieval.
	ORNL/TM-13756.E.Chisholm/T.G.Kolda
	</pre>

word_similarity.html.wmlNo OneTemporaryActions

File Metadata

word_similarity.html.wmlView Options

Event Timeline

word_similarity.html.wml
No OneTemporary
Actions

word_similarity.html.wml
View Options