Page MenuHomec4science

bibmatch-admin-guide.webdoc
No OneTemporary

File Metadata

Created
Fri, Aug 9, 15:26

bibmatch-admin-guide.webdoc

# -*- mode: html; coding: utf-8; -*-
# This file is part of Invenio.
# Copyright (C) 2007, 2008, 2010, 2011 CERN.
#
# Invenio is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License as
# published by the Free Software Foundation; either version 2 of the
# License, or (at your option) any later version.
#
# Invenio is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with Invenio; if not, write to the Free Software Foundation, Inc.,
# 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.
<!-- WebDoc-Page-Title: BibMatch Admin Guide -->
<!-- WebDoc-Page-Navtrail: <a class="navtrail" href="<CFG_SITE_URL>/help/admin<lang:link/>">_(Admin Area)_</a> -->
<!-- WebDoc-Page-Revision: $Id$ -->
<ul style="list-style-type:None">
<li><strong>1. <a href="#shortIntro">Overview</a></strong>
<ul style="list-style-type:None">
<li>1.1&nbsp;&nbsp;<a href="#overview">What is BibMatch?</a></li>
<li>1.2&nbsp;&nbsp;<a href="#features">Features</a></li>
</ul>
</li>
<li><strong>2. <a href="#usage">Usage</a></strong>
<ul style="list-style-type:None">
<li>2.1&nbsp;&nbsp;<a href="#basic">Basic usage</a></li>
<li>2.2&nbsp;&nbsp;<a href="#advanced">Advanced usage</a></li>
</ul>
</li>
<li><strong>3. <a href="#examples">More examples</a></strong>
<ul style="list-style-type:None">
<li>3.1&nbsp;&nbsp;<a href="#more">Some more examples</a></li>
</ul>
</li>
<li><strong><a href="#usage">Appendix</a></strong>
<ul style="list-style-type:None">
<li>A&nbsp;&nbsp;<a href="#querystrings">Querystrings</a></li>
<li>B&nbsp;&nbsp;<a href="#bibconvert">BibConvert formats</a></li>
<li>C&nbsp;&nbsp;<a href="#predefined">Predefined search fields</a></li>
<li>D&nbsp;&nbsp;<a href="#cli">BibMatch commmand-line tool</a></li>
</ul>
</li>
<h2><a name="shortIntro">1. Overview</a></h2>
<h3><a name="overview">1.1 What is BibMatch</a></h3>
<p>BibMatch is a tool for matching bibliographic meta-data records against a local or remote Invenio repository.
The incoming records can be matched against zero, one or more then one records. This way,
it is possible to identify potential duplicate entries, before they are uploaded into the repository.
This can also be helpful in detecting already existing duplicates in the database. </p>
<p>BibMatch also acts as a filter for incoming records, by splitting the records into 'new records' or 'existing records'.
In most cases this separation makes a big difference when ingesting content into a digital repository.</p>
<h3><a name="features">1.2 Features</a></h3>
<ul>
<li>Matches meta-data records locally and remotely</li>
<li>Supports user authentication to allow matching against restricted collections</li>
<li>Highly configurable match validation step for reliable matching results</li>
<li>Allows full customisation of the search queries used to find matching candidates</li>
<li>Incoming meta-data can be manipulated through BibConvert formatting functions</li>
<li>Supports transliteration of Unicode meta-data into ASCII (useful for legacy systems)</li>
</ul>
<h2><a name="usage">2. Usage</a></h2>
<h3><a name="basic">2.1 Basic usage</a></h3>
<h4>Input records</h4>
<p>BibMatch needs a set of records to match. You can give BibMatch these records in two ways:
<p>By standard input:</p>
<blockquote>
<pre>
$ bibmatch < input.xml
</pre>
</blockquote>
<p>or, by <em>-i</em> parameter:</p>
<blockquote>
<pre>
$ bibmatch -i input.xml
</pre>
</blockquote>
</p>
<h4>Output records</h4>
<p>When BibMatch matches records, they will be classified in one of 4 ways:<br/>
<ul>
<li><strong>Match</strong> - exact match found</li>
<li><strong>Ambiguous</strong> - record matches more then one record</li>
<li><strong>Fuzzy</strong> - record <em>may</em> match a record</li>
<li><strong>New</strong> - record does not match any record</li>
</ul>
</p>
<p>You can choose which types of records to output after matching has completed by specifying it in the command:</p>
<blockquote>
<pre>
$ bibmatch --print-match < input.xml
</pre>
</blockquote>
<blockquote>
<pre>
$ bibmatch --print-new < input.xml
</pre>
</blockquote>
<p>You can also output all matching results to a set of files, by specifying a filename-prefix to the command-line option <em>-b</em>:</p>
<blockquote>
<pre>
$ bibmatch -b output_results < input.xml
</pre>
</blockquote>
<p>This will create a set of 4 files: <em>output_results.matched.xml</em>, <em>output_results.new.xml</em>, <em>output_results.fuzzy.xml</em>, <em>output_results.ambiguous.xml</em></p>
<h4>Matching queries</h4>
<p>By default, BibMatch will try to find potential record matches using the MARC tag 245__$$a, i.e. the title. In many cases this is not an efficient metric
to find all potential matching records with, because of its ambiguous nature. As such, BibMatch provide users a way to specify the exact queries to use
and where to extract meta-data to put in the queries.</p>
<p>Using the command-line option <em>-q</em> you can specify your own <em>querystrings</em> to use when searching for records.
Read more about querystrings <a href="#querystrings">here</a>.</p>
<p>For example, if available in the meta-data, you can search using the ISBN or DOI, which <em>usually</em> are stable identifiers:</p>
<blockquote>
<pre>
$ bibmatch -q "[020__a] or [0247_a]" -b output_results < input.xml
</pre>
</blockquote>
<p>As you can see, any data from the input record you want to replace in the query is referenced using square-brackets [] containing the exact MARC notation.</p>
<p>You can specify several <em>-q</em> queries and they will all be performed <u>in order</u>, until a match is found. If you want to avoid
specifying long complicated query-strings every time, you can use short-hand <em>template queries</em> that can be defined in the configuration of BibMatch. (See hacking guide)</p>
<h4>Match remote installations</h4>
<p>By default, BibMatch will try to match records on the local installation. In order to match against records on a remote Invenio installation
(like <a href="http://cds.cern.ch" target=_blank>http://cds.cern.ch</a> or <a href="http://inspire-hep.net" target=_blank>http://inspire-hep.net</a>)
you can use the option <em>-r</em>:</p>
<blockquote>
<pre>
$ bibmatch -r "http://cds.cern.ch" -q "[020__a] or [0247_a]" -b output_results < input.xml
</pre>
</blockquote>
<p>If the remote installation requires user authentication, you can also specify this using <em>--user</em>.
You will then be prompted for a password:</p>
<blockquote>
<pre>
$ bibmatch -r "http://cds.cern.ch" --user admin -q "[020__a] or [0247_a]" -b output_results < input.xml
</pre>
</blockquote>
<h4>Using TEXTMARC</h4>
<p>BibMatch usually works with MARCXML, but you can still input TEXTMARC files that will be automatically converted
to MARCXML in BibMatch. If you want to also receive the output in TEXTMARC (limited support) you can use the
command-line option <em>-t</em>:</p>
<blockquote>
<pre>
$ bibmatch -b out.marc -t < input.marc
</pre>
</blockquote>
<h3><a name="basic">2.2 Advanced usage</a></h3>
<p>When matching records against a repository you <em>may</em> sometimes want to update or replace any existing records
in the database with the given records. In order to allow for easy upload after BibMatch has run, you can use the
<em>-a</em> parameter to tell BibMatch to add the matched records 001 identifier.</p>
<blockquote>
<pre>
$ bibmatch -q "[020__a] or [0247_a]" -b output_records -a < input.xml
</pre>
</blockquote>
<p>To match using BibConvert formats to manipulate MARC field-values from the input records.
See <a href="<CFG_SITE_URL>/help/admin/bibconvert-admin-guide#C.3.4<lang:link/>">BibConvert Admin Guide</a>
for more details on formats. An example of extracting the first word from the 100__a field and lower-case it:
</p>
<blockquote>
<pre>
$ bibmatch --query-string=\"[245__a] [100__a::WORDS(1,R)::DOWN()]" < input.xml > output.xml
</pre>
</blockquote>
<p>You can also search directly in specific collection(s) in the repository using the <em>--collection</em> option.
To match more then one collection, separate each with comma:</p>
<blockquote>
<pre>
$ bibmatch --collection 'Books,Articles' < input.xml
</pre>
</blockquote>
<p>If some collections are restricted or you are searching for restricted meta-data fields in the repository, you
can specify a user login with the <em>--user</em> command:
<blockquote>
<pre>
$ bibmatch --collection 'Theses' --user admin < input.xml
</pre>
</blockquote>
<h2><a name="usage">3. More examples</a></h2>
<h3><a name="usage">3.1 More examples</a></h3>
To match records on title in the title index, also print out only new (unmatched) ones:
<blockquote>
<pre>
$ bibmatch --print-new -q "[title]" --field=\"title\" < input.xml > output.xml
</pre>
</blockquote>
To print potential duplicate entries before manual upload using predefined queries, use:
<blockquote>
<pre>
$ bibmatch --print-match -q title-author < input.xml > output.xml
</pre>
</blockquote>
Two options for matching on multiple fields, including predefined fields (title, author etc.):
<blockquote>
<pre>
$ bibmatch --query-string="[245__a] [author]" < input.xml > output.xml
$ bibmatch --query-string="245__a||author" < input.xml > output.xml
</pre>
</blockquote>
To print "fuzzy" (almost matching by title) records:
<blockquote>
<pre>
$ bibmatch --print-fuzzy < input.xml > output.xml
</pre>
</blockquote>
<p>An example of use of predefined searching </p>
<blockquote>
<pre>
$ bibmatch --print-match -q title-author < input.xml > output.xml
</pre>
</blockquote>
<h2><a name="usage">Appendix</a></h2>
<h3><a name="querystrings">A. Querystrings</a></h3>
<p>Querystrings determine which type of query/strategy to use when searching for the matching records in the database.</p>
<p><strong>Predefined querystrings:</strong></p>
<p>There are some predefined querystrings available:
<ul>
<li><strong>title</strong> - standard title search. (i.e. "this is a title") (default)</li>
<li><strong>title-author</strong> - title and author search (i.e. "this is a title AND Lastname, F")</li>
<li><strong>reportnumber</strong> - reportnumber search (i.e. reportnumber:REP-NO-123).</li>
</ul>
</p>
<p>You can also add your own predefined querystrings inside invenio.conf file.</p>
<p>You can structure your query in different ways:</p>
<ul>
<li>
Old-style: fieldnames separated by '||' (conforms with earlier BibMatch versions):
<pre>
-q "773__p||100__a"
</pre>
</li>
<li>
New-style: Invenio query syntax with "bracket syntax":
<pre>
-q "773__p:\"[773__p]\" 100__a:[100__a]"
</pre>
</li>
</ul>
<p>Depending on the structure of the query, it will fetch associated values from each record and put it into
the final search query. i.e in the above example it will put journal-title from 773__p.</p>
<p>When more then one value/datafield is found, i.e. when looking for 700__a (additional authors),
several queries will be put together to make sure all combinations of values are accounted for.
The queries are separated with given operator (-o, --operator) value.</p>
<p>Note: You can add more then one query to a search, just give more (-q, --query-string) arguments.
The results of all queries will be combined when matching.</p>
<h3><a name="bibconvert">B. BibConvert formats</a></h3>
<p>Another option to further improve your matching strategy is to use BibConvert formats. By using the formats
available by BibConvert you can change the values from the retrieved record-fields.</p>
<p> i.e. using WORDS(1,R) will only return the first (1) word from the right (R). This can be very useful when
adjusting your matching parameters to better match the content. For example only getting authors last-name
instead of full-name.</p>
<p>You can use these formats directly in the querystrings (indicated by '::'):<br />
<ul>
<li>Old-style: -q "100__a::WORDS(1,R)::DOWN()"<br />
This query will take first word from the right from 100__a and also convert it to lower-case.
</li>
<li>New-style: -q "100__a:[100__a::WORDS(1,R)::DOWN()]"<br />
See BibConvert documentation for a more detailed explanation of formats.
</li>
</ul>
</p>
<h3><a name="predefined">C. Predefined fields</a></h3>
<p>In addition to specifying distinct MARC fields in the querystrings you can use predefined
fields as configured in the LOCAL(!) Invenio system. These fields will then be mapped to one
or more fieldtags to be retrieved from input records.</p>
<p>Common predefined fields used in querystrings: (for Invenio demo site, your fields may vary!)<br />
<pre>'abstract', 'affiliation', 'anyfield', 'author', 'coden', 'collaboration',
'collection', 'datecreated', 'datemodified', 'division', 'exactauthor', 'exactfirstauthor',
'experiment', 'fulltext', 'isbn', 'issn', 'journal', 'keyword', 'recid',
'reference', 'reportnumber', 'subject', 'title', 'year'</pre>
</p>
<h3><a name="cli">D. BibMatch commmand-line tool</a></h3>
<blockquote>
<pre>
Output:
-0 --print-new (default) print unmatched in stdout
-1 --print-match print matched records in stdout
-2 --print-ambiguous print records that match more than 1 existing records
-3 --print-fuzzy print records that match the longest words in existing records
-b --batch-output=(filename). filename.new will be new records, filename.matched will be matched,
filename.ambiguous will be ambiguous, filename.fuzzy will be fuzzy match
-t --text-marc-output transform the output to text-marc format instead of the default MARCXML
Simple query:
-q --query-string=(search-query/predefined-query) See "Querystring"-section below.
-f --field=(field)
General options:
-n --noprocess Do not print records in stdout.
-i, --input use a named file instead of stdin for input
-v, --verbose=LEVEL verbose level (from 0 to 9, default 1)
-r, --remote=URL match against a remote Invenio installation (Full URL, no trailing '/')
Beware: Only searches public records attached to home collection
-a, --alter-recid The recid (controlfield 001) of matched or fuzzy matched records in
output will be replaced by the 001 value of the matched record.
Note: Useful if you want to replace matched records using BibUpload.
-z, --clean clean queries before searching
--no-validation do not perform post-match validation
-h, --help print this help and exit
-V, --version print version information and exit
Advanced options:
-m --mode=(a|e|o|p|r) perform an advanced search using special search mode.
Where mode is:
"a" all of the words,
"o" any of the words,
"e" exact phrase,
"p" partial phrase,
"r" regular expression.
-o --operator(a|o) used to concatenate identical fields in search query (i.e. several report-numbers)
Where operator is:
"a" boolean AND (default)
"o" boolean OR
-c --config=filename load querystrings from a config file. Each line starting with QRYSTR will
be added as a query. i.e. QRYSTR --- [title] [author]
-x --collection only perform queries in certain collection(s).
Note: matching against restricted collections requires authentication.
--user=USERNAME username to use when connecting to Invenio instance. Useful when searching
restricted collections. You will be prompted for password.
</pre>
</blockquote>

Event Timeline