[tex-live] User names too longs or with diacritics on Windows
Jonathan Kew
jfkthame at googlemail.com
Sun Apr 14 16:42:40 CEST 2013
On 14/4/13 15:17, Khaled Hosny wrote:
> On Sun, Apr 14, 2013 at 02:59:07PM +0100, Philip TAYLOR wrote:
>> I wonder whether it might be worth bringing Jonathan Kew,
>> and/or the current maintainer(s) of XeTeX, into this
>> discussion; they must surely have a reasonable knowledge
>> of what additional steps are needed to ensure that
>> Windows utilities are Unicode-aware.
>
> The Windows port of XeTeX was done by Akira, so he is the expert.
>
Yes, as Khaled says. I've never even attempted to build xetex on
windows; Akira has always done that (for which I am -extremely- grateful).
Having said that, I can take a guess at the problem illustrated in your
example. XeTeX defaults to interpreting its input as utf-8 (unless it
detects utf-16, by "sniffing" the first couple of bytes). However, I
think the shell on windows is passing the command line to xetex using
the system default codepage (CP1252 in your case, probably). So the "é"
character in your filename, which in utf-8 would be encoded as the bytes
<0xC3 0xA9>, is not received in that form but as the single byte <0xE9>
("é" in CP1252). As far as xetex is concerned, that looks like the first
byte of a multi-byte utf-8 sequence, so it tries to interpret it as
such, and the result is "garbage".
To fix this, xetex would need to ask the system what the current
codepage is, and convert the command-line from that codepage to unicode
for its internal use.
Moreover, for messages written to the terminal to appear correctly, it
would also need to convert those messages back from unicode to the
system codepage - or avoid the issue by the use of ^^xx escapes, so that
the terminal output is pure ascii.
That should deal with decoding the command line correctly, I think. I'm
not sure whether the file-access APIs that xetex uses can actually use
unicode filenames, or whether it would also need to convert back from
unicode names (whether from the command line or from names used within
documents) back to system codepage in order to actually open the file.
That may depend whether it's using the posix APIs (probably depend on
the system codepage) or windows-specific APIs that handle unicode
natively. Akira would know more about this, I'm sure.
(In theory, I suspect similar encoding issues apply on *nix platforms,
but the use of utf-8 as the default codepage is pretty widespread these
days, so most people won't run into these problems.)
Oh, and as for (pdf)tex: it doesn't run into these problems because it
treats the command line, like all input, simply as a string of bytes,
without regard to encoding. Whatever bytes it receives there will
presumably be passed unchanged to the file-access APIs. So things should
normally work, although it may be unable to access files whose (unicode)
name cannot be represented in the current system codepage.
Moreover, if I understood Zdeněk's message correctly, it sounded like
there may sometimes be a mismatch between the codepage that the shell
(cmd.exe) is using (and hence the byte sequence passed to the *tex
process on the command line) and the codepage assumed by the APIs used
to access files within the binary. If that's the case, things will
indeed break.
JK
More information about the tex-live
mailing list