Homec4science

Script to detect and show invalid UTF-8 sequences in documents.

Authored by epriestley <git@epriestley.com> on Jun 11 2011, 20:31.

Description

Script to detect and show invalid UTF-8 sequences in documents.

Summary:
'arc' is now less forgiving about playing it fast-and-loose with encodings, so
make it easier to find and fix these problems in a codebase with preexisting
issues. See D431.

This script takes a list of files and either does a simple binary test for utf8
validity with the "-t" flag, or shows you exactly what problems exist without
it. The idea is that if you have an existing codebase with a bunch of code and
you want to fix all the UTF-8 problems, you can do something like this:

find . -type f -name '*.php' -print0 | xargs -0 -n256 ./utf8.php -t

That will give you a list of all the files with problems. Now you can inspect
them individually or in groups and fix the issues:

./utf8.php file1 file2 ...

You can also find problems in a diff, or some other piece of command output:

git diff -U99999 HEAD | ./utf8.php -

Test Plan:
Ran this script on various valid and invalid files.

Reviewed By: jungejason
Reviewers: edward, aran, jungejason, tuomaspelkonen
CC: aran, jungejason
Differential Revision: 433

Details

Committed
epriestley <git@epriestley.com>Jun 13 2011, 22:48
Pushed
aubortMar 17 2017, 12:03
Parents
rPHU4aec97acc9c7: Test Conduit connection before daemonizing if --daemonize and --conduit-uri are…
Branches
Unknown
Tags
Unknown

Event Timeline

epriestley <git@epriestley.com> committed rPHU96b5320b9ef2: Script to detect and show invalid UTF-8 sequences in documents. (authored by epriestley <git@epriestley.com>).Jun 13 2011, 22:48