Homec4science

Use unicode mode when tokenizing strings like user realnames

Authored by epriestley <git@epriestley.com> on Nov 8 2015, 14:36.

Description

Use unicode mode when tokenizing strings like user realnames

Summary:
Fixes T9732. We currently tokenize strings (like user realnames) in the default non-unicode mode, which can cause patterns like \s to work incorrectly.

Use /u to use unicode-aware tokenization instead.

Test Plan:
The behavior of "\s" depends upon environmental settings like LC_ALL.

With LC_ALL set to "C", \xA0 is not considered a whitespace character.
With LC_ALL set to "en_US", it is:

$ php -r 'setlocale(LC_ALL, "C"); echo count(preg_split("/\s/", "\xE5\xBF\xA0")) . "\n";'
1
$ php -r 'setlocale(LC_ALL, "en_US"); echo count(preg_split("/\s/", "\xE5\xBF\xA0")) . "\n";'
2

To reproduce the original issue, I added an explicit:

setlocale(LC_ALL, "en_US");

...call before the preg_split() call. This caused "忠" to be improperly split.

I then added "/u", and observed proper tokenization.

Reviewers: chad

Reviewed By: chad

Subscribers: qiu8310

Maniphest Tasks: T9732

Differential Revision: https://secure.phabricator.com/D14441

Details

Committed
epriestley <git@epriestley.com>Nov 8 2015, 16:03
Pushed
aubortJan 31 2017, 17:16
Parents
rPH37df419266d4: Add Can Create Policy Capability to Phame Blogs
Branches
Unknown
Tags
Unknown

Event Timeline

epriestley <git@epriestley.com> committed rPH152ddf57092e: Use unicode mode when tokenizing strings like user realnames (authored by epriestley <git@epriestley.com>).Nov 8 2015, 16:03