Page Menu
Home
c4science
Search
Configure Global Search
Log In
Files
F91121850
utf8.diviner
No One
Temporary
Actions
Download File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Subscribers
None
File Metadata
Details
File Info
Storage
Attached
Created
Fri, Nov 8, 03:29
Size
2 KB
Mime Type
text/x-c
Expires
Sun, Nov 10, 03:29 (1 d, 23 h)
Engine
blob
Format
Raw Data
Handle
22200770
Attached To
rPH Phabricator
utf8.diviner
View Options
@title
User Guide: UTF-8 and Character Encoding
@group
userguide
How Phabricator handles character encodings.
= Overview =
Phabricator stores all internal text data as UTF-8, processes all text data
as UTF-8, outputs in UTF-8, and expects all inputs to be UTF-8. Principally,
this means that you should write your source code in UTF-8. In most cases this
does not require you to change anything, because ASCII text is a subset of
UTF-8.
= Detecting and Repairing Files =
It is recommended that you write source files only in ASCII text, but
Phabricator fully supports UTF-8 source files. However, it won't currently do
encoding transformation, so if you have source files which are not valid UTF-8
you may run into issues.
If you have a project which isn't valid UTF-8 because a few files have random
binary nonsense in them, there is a script in libphutil which can help you
identify and fix them:
project/ $ libphutil/scripts/utils/utf8.php
Generally, run this script on all source files with "-t" to find files with bad
byte ranges, and then run it without "-t" on each file to identify where there
are problems. For example:
project/ $ find . -type f -name '*.c' -print0 | xargs -0 -n256 ./utf8 -t
./hello_world.c
If this script exits without output, you're in good shape and all the files that
were identified are valid UTF-8. If it found some problems, you need to repair
them. You can identify the specific problems by omitting the "-t" flag:
project/ $ ./utf8.php hello_world.c
FAIL hello_world.c
3 main()
4 {
5 printf ("Hello World<0xE9><0xD6>!\n");
6 }
7
This shows the offending bytes on line 5 (in the actual console display, they'll
be highlighted). Often a codebase will mostly be valid UTF-8 but have a few
scattered files that have other things in them, like curly quotes which someone
copy-pasted from Word into a comment. In these cases, you can just manually
identify and fix the problems pretty easily.
If you have a prohibitively large number of UTF-8 issues in your source code,
Phabricator doesn't include any default tools to help you process them in a
systematic way. You could hack up
##utf8.php##
as a starting point, or use other
tools to batch-process your source files.
NOTE: If you have a project which uses a
//different encoding//
for source
files, there is no easy way to get it working with Phabricator or Arcanist right
now. If it's not reasonable to switch to UTF-8, tell us more about your use case
and we can evaluate supporting it. Since tools like Git don't work well with
other encodings, the prevailing assumption is that this is a rare situation.
Event Timeline
Log In to Comment