Homec4science

BibIndex: fuzzy author name tokenizer

Authored by Joe Blaylock <jrbl@slac.stanford.edu> on Mar 18 2010, 03:16.

Description

BibIndex: fuzzy author name tokenizer

  • Introduced fuzzy author name tokenizer.
    • Get a tokenizer with b_e_t.BibIndexFuzzyNameTokenizer().
    • Call tokenizer.tokenize(name) to get a (potentially long) list of expanded forms, suitable for phrase-indexing. Strings in, lists of strings out.
    • Or, call with tokenizer.scan(name) to turn name into an idiosyncratic data structure used by tokenizer.parse_scanned. This structure is a dictionary that tags non-lastnames, lastnames, and titles.
    • You can also call tokenizer.parse_scanned(tagged stuff) to generate the expanded forms directly from tagged data.
  • Includes unit tests for all of the above that cover something like 98% of the common cases. Effort has been made to test for pathological names also.

    (closes: #14366, #64426, #14513)

Details

Committed
Tibor Simko <tibor.simko@cern.ch>Mar 28 2010, 19:02
Parents
R3600:9130a1908bf0: dbexec: added interactive option
Branches
Unknown
Tags
Unknown

Event Timeline

Tibor Simko <tibor.simko@cern.ch> committed R3600:012d0e547d2b: BibIndex: fuzzy author name tokenizer (authored by Joe Blaylock <jrbl@slac.stanford.edu>).Mar 28 2010, 19:02