## -*- mode: html; coding: utf-8; -*- ## This file is part of Invenio. ## Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 CERN. ## ## Invenio is free software; you can redistribute it and/or ## modify it under the terms of the GNU General Public License as ## published by the Free Software Foundation; either version 2 of the ## License, or (at your option) any later version. ## ## Invenio is distributed in the hope that it will be useful, but ## WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU ## General Public License for more details. ## ## You should have received a copy of the GNU General Public License ## along with Invenio; if not, write to the Free Software Foundation, Inc., ## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.
What the search engine does after it receives a user query? Let us explain the steps it takes for a fictive complex query example in which a user typed author:ellis reportnumber:TH-2003-114 -muon +energ* title:"kaon decay" and selected "Theses" and "Books" collections and the "arrival date" between the 10th and the 20th of March 2003. The search engine proceeds as follows. 1. Firstly we classify and break apart our search arguments into a) basic search units and b+c) additional search options: a) (p_1, f_1), (p_2, f_2), ..., (p_m, f_m) --- the list of basic (pattern, field) searching units. The user-typed boolean query is split into the (p_i, f_i) units according to the non-significant whitespace. (A whitespace is considered to be significant when it occurs within quoted expressions for phrase searching, see Search Tips and user-level documentation.) In our example, the list of basic search units is: (author, ellis), (reportnumber, TH-2003-114), (anyfield, muon), (anyfield, energ*), (title, "kaon decay"). b) c_1, c_2, ..., c_n --- the list of collections the user wanted to search in. In our example: Theses, Books. c) l_1, l_2, ..., l_o --- the list of additional limiting search criteria, such as limit to certain arrival date, certain language, or certain subject category. The user usually selects such limits from within Advanced Search interface and its selection boxes. In our example, the limit is on the arrival date: (arrivaldate, 20030310->20030320). The basic search units and additional search arguments are then dealt with subsequently (from a to c with decreasing priority) in the following searching stages. 2. For each (p_i, f_i), we verify that at least some hits can be found regardless of c_j and l_k. In other words, we make sure that p_i is a known indexed term in f_i. Note that p_i may contain asterisks and may start and end by a single/double quotes, with the following special meaning: foo*bar -- asterisk is a wildcard character, meaning to match any sequence of characters "foo and bar" -- double quotes to denote exact phrase matching 'foo and bar' -- single quotes to denote partial phrase matching The p_i (word or phrase) is then tested for existence in the f_i (word or phrase) index. 2-1. If p_i was found in the f_i index, we retain the (p_i, f_i) search unit. We also retain it in the case when p_i wasn't found in the f_i index but when (p_i, f_i) is joined to the previous unit or to the next unit by boolean operator OR. 2-2. If p_i wasn't found inside f_i, we then look whether p_i contains some non-alphanumeric characters, such as a dash or a slash. If this is so, we try to replace them with a boolean AND query. In our example, the search unit (reportnumber, TH-2003-114) would be replaced by a new boolean query for (reportnumber,TH) and (reportnumber,2003) and (reportnumber,114). If this new query succeeds, we retain this new boolean query in place of the old search unit. 2-3. If the preceding step failed, we propose to the end user a list of nearest indexed terms (words, phrases) around p_i within f_i index and let the user choose one known indexed term out of this list. In our example, the phrase "kaon decay" cannot be found in the title index, so we'll propose closest titles around "kaon decay ...". Let's suppose that the user will choose "Kaon decays and the flavour problem" out of this list. After all the basic search units (p_i, f_i) have been treated we may proceed to the following stage 3. 3. At this stage, all search units (p_i, f_i) are known to yield at least some results. We now continue by trying boolean query as specified by the user. The execution priority goes from left to right, the known boolean operators are: + for set intersection - for set difference | for set union In our example, we proceed by doing the set intersections of (author, ellis), (reportnumber,TH), (reportnumber,2003), (reportnumber,114), followed by set differentiation with (anyfield, muon), followed by set intersection with (anyfield, energ*) and (title, "Kaon decays and the flavour problem"). 3-1. If this gives some hits, we proceed to stage 4. 3-2. If this does not give any hit, we display the number of hits found for each search unit, advise the user to combine his search terms differently, and we stop. 4. At this stage, the boolean query (p_1, f_1), (p_2, f_2), ... , (p_n, f_n) is known to yield some results. We now continue by checking whether these results fall into collections c_j that the user has chosen. This is done by performing a set intersection of the results obtained so far with the collection universe for c_j. 4-1. If this gives some hits, we proceed to stage 5. 4-2. If this does not give any hit, we first try to look for the query in any public collection. 4-2-1. If this gives some hits, we warn the user that no match could have been found in his c_j choice but that there are hits in other public collections. We propose a link to get them and we stop. 4-2-2. If this does not give any hit, then there must have been some hits in some of the restricted collections. We display a warning that the restricted collections must be explicitly selected before searching and we stop. 5. At this stage, the boolean query (p_i, f_i) within c_j is known to yield some results. We now proceed by checking additional search limits l_k imposed by the user. This is done by subsequent set intersections of the results obtained so far with the universe of records matching limiting criteria (l_1, l_2, ..., l_o). 5-1. If this gives some hits, we proceed to stage 6. 5-2. If this does not give any hit for a certain l_k, we warn the user that no match could have been found for his l_o choices and proceed to stage 6 with the results obtained so far. 6. We are done and may display the results.