[tex-live] Problems with non-7bit characters in filename

Fri Jul 4 11:53:49 CEST 2014

2014-07-04 11:02 GMT+02:00 Klaus Ethgen <Klaus+texlivelist at ethgen.ch>:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA512
>
> Am Fr den  4. Jul 2014 um  2:17 schrieb Reinhard Kotucha:
>> On 2014-07-04 at 00:08:42 +0100, Klaus Ethgen wrote:
>>
>>  > Am Do den  3. Jul 2014 um 23:46 schrieb Zdenek Wagner:
>>  > > > I was pointed to this list to report the following Bug. Please put me in
>>  > [Bug in filesystem code]
>>  > > Lualatex is right, umlaut characters in latin1 are invalid sequences
>>  >
>>  > Thats true. While latin1 can include every possible character, UTF-8
>>  > cannot. (possible as possible to have on the wire)
>>
>> You misunderstood.  The opposite is true.  UTF-8 (Unicode) supports
>> all characters, Latin1 is a simple 8-bit encoding which supports only
>> Western European languages (except French).
>
> Well, no, way around. In UTF-8, there are byte values that are invalide.
> In latin1 there is no invalide byte value.
>
>> UTF-8 is the encoding of the future because it supports all languages
>> used today.  This is the reason why XeTeX and LuaTeX exist at all.
>
> Well, no, not all programming languages supporting UTF-8 native.
>
> I even think that programming languages and tools should be encoding
> agnostics and just use what they get.
>
No, they cannot in principle if they work with character strings. It
is possible if they work with bytes but then you have to do all the
work yourself, you cannot use wchar.

>> When I took over maintenance of VnTeX my OS still used Latin1.  It was
>> a pain!  I then switched to UTF-8 and everything worked fine.
>
> Thats your experiences, mine are the complete opposite.
>
>> IMO all these national ISO-2022/ISO-8859 encodings are archaic.  The
>> future is UTF-8.
>
> Thats just a holy question. I don't think  they are archaic. They stay
> side by side with all the Unicode encodings.
>
You have just one unicode encoding but several unicode transport formats.

> But unfortunately, the discussion about encodings is often just going
> onto a holy discussion.
>
>>  > > in utf-8 but both luatex and xetex work internally in unicode. I
>>  > > am not sure whether it is possible to change interaction with
>>  > > file system encoding easily.
>>  >
>>  > Why converting the filename at all? The file name is the same on
>>  > command line and on the file system. So without any reencoding
>>  > everything would be fine.
>>
>> It's not always the case.  A German Windows is using CP1252 on the
>> command line and UTF-16 internally for file names.  It's a pain.
>
> That could be. I do not care about windows. But I believe that a TeX
> distribution must care about such cases.
>
You are free to submit a patch, API calls for determining the file
system ecodimg are documented both for unix systems and for Windows.

> However, even in that case there is a clear path from one encoding into
> another. Even if not all characters are able to display in the other
> charset, TeX has to respect the console charset too. Why isn't that the
> same on Unix/Linux?
>
>> Yes, AFAIK OpenOffice gratefully supports UTF-8.  You should have
>> configured your file system to use UTF-8.
>
> I do not get the connection between openoffice and why I >>should have<<
> the file system in UTF-8. I do not use office suites.
>
> To the point about using UTF-8 on filesystem. That would end in having
> filenames that are cannot manages anymore. No mv no rm, simply nothing
> would be possible with them if they have byte values in filename that
> are invalide in UTF-8. Such files will stay forever in filesystem. Even
> if the filesystem is clean in the begin, such erroneous byte values
> could happen in a crash or OS level error.
>
No, that's not true. I have UTF-8 names and have file name in Czech
with accented characters, i Urdu (Arabic script) and Hindi (Devanagari
script) and all system commands as cp, mv etc. work without problem. I
can even copy such files to Dropbox, share them with Windows users and
they see the file names.

> Even more, it will serve me with filenames I do not want to have ever on
> filesystem. (Like the latin a replaced by a Russian one. And this is
> just a annoyance, there are more sever code points in UTF-8.) Unpacking
> containers from other people is just to dangerous if using UTF-8 on the
> filesystem. The worst that could happen in latin1 is some crappy
> filename that has to be renamed. But no harm could happen there.
>
>> This is the default for years (on Linux and OS/X, at least).
>
> Don't use defaults without questioning and understanding them. I don't,
> I have good reasons why I use latin1. I have no reason to use UTF-8.
>
>> Windows always lags 20..30 years behind and still insists on CP1252
>> (CP850 on the command line) for German and similar idiocies for other
>> languages.
>
> I do not want to talk about windows. That is a play OS in my eyes. But
> just as I know, windows was using Unicode long before Unix even think
> about that. And wasn't the CP850 stuff just used for DOS?
>
>>  > I never had that problems with latin1 (except with only few software
>>  > like luatex). But I had many problems in past with trying to use UTF-8.
>>  > However, that personal stuff is good to know but does not help in this
>>  > situation.
>>
>> But now you have problems with Latin1.  The reason is that you still
>> insist on archaic encodings like Latin1 while the rest of the world is
>> striving towards Unicode.
>
> Well, just because there is a bug in one software. Just because one
> piece of software is wrongly programmed.
>
> I do not want to join the UTF-8 religion. So please don't try to
> proselytize me!
>
>>  > Fact is that even software that uses UTF-8 (or other unicode) internal,
>>  > work well in my environment. (Examples: Libreoffice, Gimp, Geeqie, ...
>>  > (Geeqie, I am one of the people working on it)) So it must be possible
>>  > to do that in lualatex or xetex too.
>>
>> I don't know what you want to achieve.  You said:
>
> Well, just work with any charset!? Most software does. Only few are
> buggy.
>
> By the way, pdflatex works well too with every charset. At the moment I
> have no real reason to use lualatex. I just wanted to play a bit with
> that and found this bug. I do not want to do religious discussions, my
> intention is to report the bug and hopefully get it fixed.
>
pdflatex knows nothing abut charsets. You have two scenarios. In the
less frequent case encTeX (by Petr Olšák) converts the input
characters directly to the font encoding while the more frequent case
makes use of the inputenc packege that contins the input via active
characters to LICR and then via the fontenc package to the font
encoding. The latter case has potential problems. For instance,

\futurelet\somechar\dosomething will not work properly with inputenc
but works well with other engines.

> Just the font handling is a bit better in lualatex. However, for my
> work, I can stay with the way LaTeX2e is doing it.
>
>>  > While latin1 can include every possible character, UTF-8 cannot.
>>
>> This is definitely wrong.  The opposite is true.
>
> No, it is true, independent from what you think. UTF-8 cannot display
> any byte value. Some are just invalid but happens here and there.
>
UTF-8 is unicode tranport format, its goal is not to transport 50%
zeros for texts containing US-ASCII only.  For fast decoding it is
designed as a prefix code and its aim is to transport characters, not
any octet.

Having encoding agnostic program, that will treat file names as octets
instead of characters, is really dangerous. Imagine that a program
will decide to write a file to a file name which is just one legal
character in some encoding and it is * in latin1. If the name goes
through expansion, this will delete all files from you current
directory. That's why good programs are encoding aware.

> Regards
>    Klaus
> - --
> Klaus Ethgen                              http://www.ethgen.ch/
> pub  4096R/4E20AF1C 2011-05-16   Klaus Ethgen <Klaus at Ethgen.de>
> Fingerprint: 85D4 CA42 952C 949B 1753  62B3 79D0 B06F 4E20 AF1C
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
>
> iQGcBAEBCgAGBQJTtm2sAAoJEKZ8CrGAGfas5c0L/1AHp7AdtFl9O69aWXofmKnf
> Qj+RKhGKcgAxQJ2NvW0FO6Iv6RmYqL7gB0kiDoMdqloSSS/5/lfRKe+QSH4hjwjq
> jDtOMoO6Rqq0sVlAMBO1TtnraJUUu1L3zEXxwUJjnfWgLguEE15mzCEqm7OyH8td
> oE3pBIRa7UcpLRlt1xK/F8lKI4DPuNqQPeUZfyJQ0ERuxjTb2O87FHgRq9A80q6f
> 39XPr2tWVqutDHXU/0JNprE5BC+/pVVdjkFcP9C2XAsixUZ+UW5Op79Kco3kDI9N
> w71luC6M0dvNpI0K69rEUJb8qDr6i+sFN7NZBf1oIq8oOENEesIXK8XvxOhnVxL5
> S9x2DczF/+Bb13eRxWktA59iue6XMzg4Kyk71XN07JUMJm+K4lLKfx3Gh11mZ/lZ
> keKiN9RtrgVguwaDyENqDHjxExSdq8L+uukp5KWtM25UwbduJT2zNG/U3xZJLi1F
> Sei1BqZkInls/0utKqnYsxNnpqZmVWKXSayJPTzZeQ==
> =Qzmj
> -----END PGP SIGNATURE-----
>
>

-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz