Script to detect and show invalid UTF-8 sequences in documents.
Summary:
'arc' is now less forgiving about playing it fast-and-loose with encodings, so
make it easier to find and fix these problems in a codebase with preexisting
issues. See D431.
This script takes a list of files and either does a simple binary test for utf8
validity with the "-t" flag, or shows you exactly what problems exist without
it. The idea is that if you have an existing codebase with a bunch of code and
you want to fix all the UTF-8 problems, you can do something like this:
find . -type f -name '*.php' -print0 | xargs -0 -n256 ./utf8.php -t
That will give you a list of all the files with problems. Now you can inspect
them individually or in groups and fix the issues:
./utf8.php file1 file2 ...
You can also find problems in a diff, or some other piece of command output:
git diff -U99999 HEAD | ./utf8.php -
Test Plan:
Ran this script on various valid and invalid files.
Reviewed By: jungejason
Reviewers: edward, aran, jungejason, tuomaspelkonen
CC: aran, jungejason
Differential Revision: 433