stages.html.wml
No OneTemporary
Actions

Subscribers

None

File Metadata

Created: Thu, Aug 8, 13:34

stages.html.wml
View Options

	## $Id$

	## This file is part of the CERN Document Server Software (CDSware).
	## Copyright (C) 2002 CERN.
	##
	## The CDSware is free software; you can redistribute it and/or
	## modify it under the terms of the GNU General Public License as
	## published by the Free Software Foundation; either version 2 of the
	## License, or (at your option) any later version.
	##
	## The CDSware is distributed in the hope that it will be useful, but
	## WITHOUT ANY WARRANTY; without even the implied warranty of
	## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
	## General Public License for more details.
	##
	## You should have received a copy of the GNU General Public License
	## along with CDSware; if not, write to the Free Software Foundation, Inc.,
	## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

	#include "cdspage.wml" \
	title="Search Processing Stages" \
	navbar_name="hacking-websearch" \
	navtrail_previous_links="<a class=navtrail href=<WEBURL>/hacking/>Hacking CDSware</a> > <a class=navtrail href=index.html>WebSearch Internals</a> " \
	navbar_select="hacking-websearch-stages"

	<pre>
	SEARCH PROCESSING STAGES / $Date$

	What the search engine does after it receives a user query? Let us
	explain the steps it takes for a fictive complex query example in
	which a user typed

	author:ellis reportnumber:TH-2003-114 -muon +energ* title:"kaon decay"

	and selected "Theses" and "Books" collections and the "arrival date"
	between the 10th and the 20th of March 2003.

	The search engine proceeds as follows.

	1. Firstly we classify and break apart our search arguments into a)
	basic search units and b+c) additional search options:

	a) (p_1, f_1), (p_2, f_2), ..., (p_m, f_m) --- the list of basic
	(pattern, field) searching units. The user-typed boolean
	query is split into the (p_i, f_i) units according to the
	non-significant whitespace. (A whitespace is considered to
	be significant when it occurs within quoted expressions for
	phrase searching, see Search Tips and user-level
	documentation.) In our example, the list of basic search
	units is: (author, ellis), (reportnumber, TH-2003-114),
	(anyfield, muon), (anyfield, energ*), (title, "kaon decay").

	b) c_1, c_2, ..., c_n --- the list of collections the user wanted
	to search in. In our example: Theses, Books.

	c) l_1, l_2, ..., l_o --- the list of additional limiting search
	criteria, such as limit to certain arrival date, certain
	language, or certain subject category. The user usually
	selects such limits from within Advanced Search interface
	and its selection boxes. In our example, the limit is on
	the arrival date: (arrivaldate, 20030310->20030320).

	The basic search units and additional search arguments are then
	dealt with subsequently (from a to c with decreasing priority) in
	the following searching stages.

	2. For each (p_i, f_i), we verify that at least some hits can be found
	regardless of c_j and l_k. In other words, we make sure that p_i
	is a known indexed term in f_i. Note that p_i may contain
	asterisks and may start and end by a single/double quotes, with the
	following special meaning:

	foo*bar -- asterisk is a wildcard character, meaning to match
	any sequence of characters

	"foo and bar" -- double quotes to denote exact phrase matching

	'foo and bar' -- single quotes to denote partial phrase matching

	The p_i (word or phrase) is then tested for existence in the f_i
	(word or phrase) index.

	2-1. If p_i was found in the f_i index, we retain the (p_i, f_i)
	search unit.

	2-2. If p_i wasn't found inside f_i, we then look whether p_i
	contains some non-alphanumeric characters, such as a dash or a
	slash. If this is so, we try to replace them with a boolean
	AND query. In our example, the search unit (reportnumber,
	TH-2003-114) would be replaced by a new boolean query for
	(reportnumber,TH) and (reportnumber,2003) and
	(reportnumber,114). If this new query succeeds, we retain
	this new boolean query in place of the old search unit.

	2-3. If the preceding step failed, we propose to the end user a
	list of nearest indexed terms (words, phrases) around p_i
	within f_i index and let the user choose one known indexed
	term out of this list. In our example, the phrase "kaon
	decay" cannot be found in the title index, so we'll propose
	closest titles around "kaon decay ...". Let's suppose that
	the user will choose "Kaon decays and the flavour problem" out
	of this list.

	After all the basic search units (p_i, f_i) have been treated we
	may proceed to the following stage 3.

	3. At this stage, all search units (p_i, f_i) are known to yield at
	least some results. We now continue by trying boolean query as
	specified by the user. The execution priority goes from left to
	right, the known boolean operators are:

	+ for set intersection
	- for set difference
	\| for set union

	In our example, we proceed by doing the set intersections of
	(author, ellis), (reportnumber,TH), (reportnumber,2003),
	(reportnumber,114), followed by set differentiation with (anyfield,
	muon), followed by set intersection with (anyfield, energ*) and
	(title, "Kaon decays and the flavour problem").

	3-1. If this gives some hits, we proceed to stage 4.

	3-2. If this does not give any hit, we display the number of hits
	found for each search unit, advise the user to combine his
	search terms differently, and we stop.

	4. At this stage, the boolean query (p_1, f_1), (p_2, f_2), ... ,
	(p_n, f_n) is known to yield some results. We now continue by
	checking whether these results fall into collections c_j that the
	user has chosen. This is done by performing a set intersection of
	the results obtained so far with the collection universe for c_j.

	4-1. If this gives some hits, we proceed to stage 5.

	4-2. If this does not give any hit, we first try to look for the
	query in any public collection.

	4-2-1. If this gives some hits, we warn the user that no match
	could have been found in his c_j choice but that there
	are hits in other public collections, and retaining
	these hits we proceed to stage 5.

	4-2-2. If this does not give any hit, then there must have
	been some hits in some of the restricted collections.
	We display a warning that the restricted collections
	must be explicitly selected before searching and we
	stop.

	5. At this stage, the boolean query (p_i, f_i) within c_j is known to
	yield some results. We now proceed by checking additional search
	limits l_k imposed by the user. This is done by subsequent set
	intersections of the results obtained so far with the universe of
	records matching limiting criteria (l_1, l_2, ..., l_o).

	5-1. If this gives some hits, we proceed to stage 6.

	5-2. If this does not give any hit for a certain l_k, we warn the
	user that no match could have been found for his l_o choices
	and proceed to stage 6 with the results obtained so far.

	6. We are done and may display the results.

	</pre>

stages.html.wmlNo OneTemporaryActions

File Metadata

stages.html.wmlView Options

Event Timeline

stages.html.wml
No OneTemporary
Actions

stages.html.wml
View Options