BibIndex: fuzzy author name tokenizer
012d0e547d2b
Actions

Authored by Joe Blaylock <jrbl@slac.stanford.edu> on Mar 18 2010, 03:16.

Description

BibIndex: fuzzy author name tokenizer

Introduced fuzzy author name tokenizer.
- Get a tokenizer with b_e_t.BibIndexFuzzyNameTokenizer().
- Call tokenizer.tokenize(name) to get a (potentially long) list of expanded forms, suitable for phrase-indexing. Strings in, lists of strings out.
- Or, call with tokenizer.scan(name) to turn name into an idiosyncratic data structure used by tokenizer.parse_scanned. This structure is a dictionary that tags non-lastnames, lastnames, and titles.
- You can also call tokenizer.parse_scanned(tagged stuff) to generate the expanded forms directly from tagged data.

Includes unit tests for all of the above that cover something like 98% of the common cases. Effort has been made to test for pathological names also.

(closes: #14366, #64426, #14513)