Homec4science

Don't let stemming reduce a word beneath 3 characters

Authored by epriestley <git@epriestley.com> on Dec 6 2016, 17:26.

Description

Don't let stemming reduce a word beneath 3 characters

Summary:
Ref T11922. Porter stems "DNS" (an acronym for "Domain Name Syrup") into "dn", which is meaningless and too short to index.

Don't let stemming make an indexable token un-indexable by shortening it: if the stem is too short, just return the normalized input.

(I believe there are very few legitimate English words that have two letter roots, anyway.)

Test Plan: Added unit tests.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T11922

Differential Revision: https://secure.phabricator.com/D17001

Details

Committed
epriestley <git@epriestley.com>Dec 6 2016, 18:10
Pushed
aubortMar 17 2017, 12:03
Parents
rPHU7009bcd3fb9b: Give PhutilClassMapQuery a public cache key
Branches
Unknown
Tags
Unknown

Event Timeline

epriestley <git@epriestley.com> committed rPHU5ac2ca121489: Don't let stemming reduce a word beneath 3 characters (authored by epriestley <git@epriestley.com>).Dec 6 2016, 18:10