Homec4science

Sanitize UTF8 more aggressively to satisfy json_encode()

Authored by epriestley <git@epriestley.com> on Aug 24 2016, 16:52.

Description

Sanitize UTF8 more aggressively to satisfy json_encode()

Summary:
Fixes T11525. Currently, there are some strings such that:

json_encode(phutil_utf8ize($string));

...fails. I encountered this with DarkConsole trying to JSON encode queries that inserted encrypted file data into the MySQL blob store, so basically random data.

There appear to be two cases we aren't handling well:

  • Overlong representations: Shorter characters can be written in an invalid way with more bytes. We previously allowed these -- sometimes -- but json_encode() does not. Instead, reject them. We already rejected overlong 2-character codes.
  • Surrogate characters: There is a range of surrogate characters reserved for use in UTF16 which json_encode() rejects. Just reject these ourselves, too.

Test Plan:
Wrote a bunch of test cases to cover this stuff, all of which now pass.

Fuzzed json_encode(phutil_utf8ize($string)) on random strings in a loop. Before these changes it would fail after a handful of attempts, in less than a second. After these changes, I ran it for several minutes and didn't see any failures.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T11525

Differential Revision: https://secure.phabricator.com/D16440

Details

Committed
epriestley <git@epriestley.com>Aug 24 2016, 18:31
Pushed
aubortMar 17 2017, 12:03
Parents
rPHU237549280f08: Improve parsing of docblock specials
Branches
Unknown
Tags
Unknown

Event Timeline

epriestley <git@epriestley.com> committed rPHU5fd1af8b4f2b: Sanitize UTF8 more aggressively to satisfy json_encode() (authored by epriestley <git@epriestley.com>).Aug 24 2016, 18:31