Page MenuHomec4science

api.html.wml
No OneTemporary

File Metadata

Created
Sat, May 11, 22:21

api.html.wml

## $Id$
## This file is part of the CERN Document Server Software (CDSware).
## Copyright (C) 2002, 2003, 2004, 2005, 2006 CERN.
##
## The CDSware is free software; you can redistribute it and/or
## modify it under the terms of the GNU General Public License as
## published by the Free Software Foundation; either version 2 of the
## License, or (at your option) any later version.
##
## The CDSware is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with CDSware; if not, write to the Free Software Foundation, Inc.,
## 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.
#include "cdspage.wml" \
title="Search Engine API" \
navbar_name="hacking-websearch" \
navtrail_previous_links="<a class=navtrail href=<WEBURL>/hacking/>Hacking CDSware</a> &gt; <a class=navtrail href=index.html>WebSearch Internals</a> " \
navbar_select="hacking-websearch-api"
<p>Version <: print generate_pretty_revision_date_string('$Id$'); :>
<protect>
<pre>
CDSware Search Engine can be called from within your Python programs
via both a high-level and low-level API interface.
1. High-level API
Description:
The high-level access to the search engine is provided by
exactly the same function as called from the web interface when
users submit their queries. This should guarantee exactly the
same behaviour, and means that you can pass to the high-level
API all the arguments as you see them in the URL.
There are two things to note: (i) the function does not check
for eventual restricted status of the collection, so the
restricted collections will be searched without asking for a
password; (ii) the output format argument (``of'') should be set
to ``id'' (which is the default value) meaning to return list of
recIDs. The function returns the list of recIDs in this case.
Signature:
def perform_request_search(req=None, cc=cdsname, c=None, p="", f="", rg="10", sf="", so="d", sp="", rm="", of="id", ot="", as="0",
p1="", f1="", m1="", op1="", p2="", f2="", m2="", op2="", p3="", f3="", m3="", sc="0", jrec="0",
recid="-1", recidb="-1", sysno="", id="-1", idb="-1", sysnb="", action="",
d1y="", d1m="", d1d="", d2y="", d2m="", d2d="", verbose="0", ap="0", ln=cdslang):
"""Perform search or browse request, without checking for
authentication. Return list of recIDs found, if of=id.
Otherwise create web page.
The arguments are as follows:
req - mod_python Request class instance.
cc - current collection (e.g. "ATLAS"). The collection the
user started to search/browse from.
c - collectin list (e.g. ["Theses", "Books"]). The
collections user may have selected/deselected when
starting to search from 'cc'.
p - pattern to search for (e.g. "ellis and muon or kaon").
f - field to search within (e.g. "author").
rg - records in groups of (e.g. "10"). Defines how many hits
per collection in the search results page are
displayed.
sf - sort field (e.g. "title").
so - sort order ("a"=ascending, "d"=descending).
sp - sort pattern (e.g. "CERN-") -- in case there are more
values in a sort field, this argument tells which one
to prefer
rm - ranking method (e.g. "jif"). Defines whether results
should be ranked by some known ranking method.
of - output format (e.g. "hb"). Usually starting "h" means
HTML output (and "hb" for HTML brief, "hd" for HTML
detailed), "x" means XML output, "t" means plain text
output, "id" means no output at all but to return list
of recIDs found. (Suitable for high-level API.)
ot - output only these MARC tags (e.g. "100,700,909C0b").
Useful if only some fields are to be shown in the
output, e.g. for library to control some fields.
as - advanced search ("0" means no, "1" means yes). Whether
search was called from within the advanced search
interface.
p1 - first pattern to search for in the advanced search
interface. Much like 'p'.
f1 - first field to search within in the advanced search
interface. Much like 'f'.
m1 - first matching type in the advanced search interface.
("a" all of the words, "o" any of the words, "e" exact
phrase, "p" partial phrase, "r" regular expression).
op1 - first operator, to join the first and the second unit
in the advanced search interface. ("a" add, "o" or,
"n" not).
p2 - second pattern to search for in the advanced search
interface. Much like 'p'.
f2 - second field to search within in the advanced search
interface. Much like 'f'.
m2 - second matching type in the advanced search interface.
("a" all of the words, "o" any of the words, "e" exact
phrase, "p" partial phrase, "r" regular expression).
op2 - second operator, to join the second and the third unit
in the advanced search interface. ("a" add, "o" or,
"n" not).
p3 - third pattern to search for in the advanced search
interface. Much like 'p'.
f3 - third field to search within in the advanced search
interface. Much like 'f'.
m3 - third matching type in the advanced search interface.
("a" all of the words, "o" any of the words, "e" exact
phrase, "p" partial phrase, "r" regular expression).
sc - split by collection ("0" no, "1" yes). Governs whether
we want to present the results in a single huge list,
or splitted by collection.
jrec - jump to record (e.g. "234"). Used for navigation
inside the search results.
recid - display record ID (e.g. "20000"). Do not
search/browse but go straight away to the Detailed
record page for the given recID.
recidb - display record ID bis (e.g. "20010"). If greater than
'recid', then display records from recid to recidb.
Useful for example for dumping records from the
database for reformatting.
sysno - display old system SYS number (e.g. ""). If you
migrate to CDSware from another system, and store your
old SYS call numbers, you can use them instead of recid
if you wish so.
id - the same as recid, in case recid is not set. For
backwards compatibility.
idb - the same as recid, in case recidb is not set. For
backwards compatibility.
sysnb - the same as sysno, in case sysno is not set. For
backwards compatibility.
action - action to do. "SEARCH" for searching, "Browse" for
browsing. Default is to search.
d1y - first date year (e.g. "1998"). Useful for search
limits on creation date.
d1m - first date month (e.g. "08"). Useful for search
limits on creation date.
d1d - first date day (e.g. "23"). Useful for search
limits on creation date.
d2y - second date year (e.g. "1998"). Useful for search
limits on creation date.
d2m - second date month (e.g. "09"). Useful for search
limits on creation date.
d2d - second date day (e.g. "02"). Useful for search limits
on creation date.
verbose - verbose level (0=min, 9=max). Useful to print some
internal information on the searching process in case
something goes wrong.
ap - alternative patterns (0=no, 1=yes). In case no exact
match is found, the search engine can try alternative
patterns e.g. to replace non-alphanumeric characters by
a boolean query. ap defines if this is wanted.
ln - language of the search interface (e.g. "en"). Useful
for internationalization.
"""
Examples:
>>> # import the function:
>>> from cdsware.search_engine import perform_request_search
>>> # get all hits in a collection:
>>> perform_request_search(cc="ATLAS Communications")
>>> # search for the word `of' in Theses and Books:
>>> perform_request_search(p="of", c=["Theses","Books"])
>>> # search for `muon or kaon' within title:
>>> perform_request_search(p="muon or kaon", f="title")
>>> # phrase search (not the quotes):
>>> perform_request_search(p='"Ellis, J"', f="author")
>>> # regexp search for a system number
>>> perform_request_search(p1="^CERN.*2003-001$", f1="reportnumber", m1="r")
>>> # moi inside Standards gives no hits...
>>> perform_request_search(p="moi", cc="Standards")
>>> # but it does if we use alternative patterns:
>>> perform_request_search(p="moi", cc="Standards", ap=1)
2. Mid-level API
Description:
The mid-level API is provided by a search_pattern() function
that only searches for the given pattern in the given field
according to the given matching pattern. This function does not
know anything about collection. The function does not wash its
arguments, it expects them to be `clean' already. The pattern
is split into `basic search units' for which a boolean query is
launched. The function returns an instance of the HitSet class.
Note that if you want to obtain the list of recIDs (as with the
high-level API), you can invoke the ``tolist()'' method on a
hitset.
Signature:
def search_pattern(req=None, p=None, f=None, m=None, ap=0, of="id", verbose=0):
"""Search for complex pattern 'p' within field 'f' according to
matching type 'm'. Return hitset of recIDs.
The function uses multi-stage searching algorithm in case of no
exact match found. See the Search Internals document for
detailed description.
The 'ap' argument governs whether an alternative patterns are to
be used in case there is no direct hit for (p,f,m). For
example, whether to replace non-alphanumeric characters by
spaces if it would give some hits. See the Search Internals
document for detailed description. (ap=0 forbits the
alternative pattern usage, ap=1 permits it.)
The 'of' argument governs whether to print or not some
information to the user in case of no match found. (Usually it
prints the information in case of HTML formats, otherwise it's
silent).
The 'verbose' argument controls the level of debugging information
to be printed (0=least, 9=most).
All the parameters are assumed to have been previously washed.
This function is suitable as a mid-level API.
"""
Examples:
>>> # import the function:
>>> from cdsware.search_engine import search_pattern
>>> # search for muon or kaon in any field:
>>> search_pattern(p="muon or kaon").tolist()
>>> # the following finds nothing by default...
>>> search_pattern(p="cern-moi").tolist()
>>> # ...but it does find something if we allow alternative patterns:
>>> search_pattern(p="cern-moi", ap=1).tolist()
>>> # wildcard search for a report number:
>>> search_pattern(p="CERN-LHC-PROJECT-REPORT-40*", f="reportnumber").tolist()
>>> # regexp search for a report number with possible trailing subjects:
>>> search_pattern(p="^CERN-LHC-PROJECT-REPORT-40(-|$)", f="reportnumber", m="r").tolist()
3. Low-level API
Description:
The low-level API is provided by search_unit() function that
assumes its arguments to be already the basic search units.
Therefore it does not know anything about boolean queries, etc.
The function returns an instance of the HitSet class. Note that
if you want to obtain the list of recIDs (as with the high-level
API), you can invoke the ``tolist()'' method on a hitset.
Signature:
def search_unit(p, f=None, m=None):
"""Search for basic search unit defined by pattern 'p' and field
'f' and matching type 'm'. Return hitset of recIDs.
All the parameters are assumed to have been previously washed.
'p' is assumed to be already a ``basic search unit'' so that it
is searched as such and is not broken up in any way. Only
wildcard and span queries are being detected inside 'p'.
This function is suitable as a low-level API.
"""
Examples:
>>> # import the function:
>>> from cdsware.search_engine import search_unit
>>> # search moi in any field:
>>> search_unit(p="moi").tolist()
>>> # this one will not match:
>>> search_unit(p="muon or kaon").tolist()
>>> # regexp search for a report number with possible trailing subjects:
>>> search_unit(p="^CERN-PS-99-037(-|$)", f="reportnumber", m="r").tolist()
More entry points may be created, but I think this threesome kind of
access to the search engine should cover all your needs.
</pre>
</protect>

Event Timeline