[tex-live] Problems with non-7bit characters in filename

Klaus Ethgen Klaus+texlivelist at ethgen.ch
Fri Jul 4 11:02:38 CEST 2014


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Am Fr den  4. Jul 2014 um  2:17 schrieb Reinhard Kotucha:
> On 2014-07-04 at 00:08:42 +0100, Klaus Ethgen wrote:
> 
>  > Am Do den  3. Jul 2014 um 23:46 schrieb Zdenek Wagner:
>  > > > I was pointed to this list to report the following Bug. Please put me in
>  > [Bug in filesystem code]
>  > > Lualatex is right, umlaut characters in latin1 are invalid sequences
>  > 
>  > Thats true. While latin1 can include every possible character, UTF-8
>  > cannot. (possible as possible to have on the wire)
> 
> You misunderstood.  The opposite is true.  UTF-8 (Unicode) supports
> all characters, Latin1 is a simple 8-bit encoding which supports only
> Western European languages (except French).

Well, no, way around. In UTF-8, there are byte values that are invalide.
In latin1 there is no invalide byte value.

> UTF-8 is the encoding of the future because it supports all languages
> used today.  This is the reason why XeTeX and LuaTeX exist at all.

Well, no, not all programming languages supporting UTF-8 native.

I even think that programming languages and tools should be encoding
agnostics and just use what they get.

> When I took over maintenance of VnTeX my OS still used Latin1.  It was
> a pain!  I then switched to UTF-8 and everything worked fine.

Thats your experiences, mine are the complete opposite.

> IMO all these national ISO-2022/ISO-8859 encodings are archaic.  The
> future is UTF-8.

Thats just a holy question. I don't think  they are archaic. They stay
side by side with all the Unicode encodings.

But unfortunately, the discussion about encodings is often just going
onto a holy discussion.

>  > > in utf-8 but both luatex and xetex work internally in unicode. I
>  > > am not sure whether it is possible to change interaction with
>  > > file system encoding easily.
>  > 
>  > Why converting the filename at all? The file name is the same on
>  > command line and on the file system. So without any reencoding
>  > everything would be fine.
> 
> It's not always the case.  A German Windows is using CP1252 on the
> command line and UTF-16 internally for file names.  It's a pain.

That could be. I do not care about windows. But I believe that a TeX
distribution must care about such cases.

However, even in that case there is a clear path from one encoding into
another. Even if not all characters are able to display in the other
charset, TeX has to respect the console charset too. Why isn't that the
same on Unix/Linux?

> Yes, AFAIK OpenOffice gratefully supports UTF-8.  You should have
> configured your file system to use UTF-8.

I do not get the connection between openoffice and why I »should have«
the file system in UTF-8. I do not use office suites.

To the point about using UTF-8 on filesystem. That would end in having
filenames that are cannot manages anymore. No mv no rm, simply nothing
would be possible with them if they have byte values in filename that
are invalide in UTF-8. Such files will stay forever in filesystem. Even
if the filesystem is clean in the begin, such erroneous byte values
could happen in a crash or OS level error.

Even more, it will serve me with filenames I do not want to have ever on
filesystem. (Like the latin a replaced by a Russian one. And this is
just a annoyance, there are more sever code points in UTF-8.) Unpacking
containers from other people is just to dangerous if using UTF-8 on the
filesystem. The worst that could happen in latin1 is some crappy
filename that has to be renamed. But no harm could happen there.

> This is the default for years (on Linux and OS/X, at least).

Don't use defaults without questioning and understanding them. I don't,
I have good reasons why I use latin1. I have no reason to use UTF-8.

> Windows always lags 20..30 years behind and still insists on CP1252
> (CP850 on the command line) for German and similar idiocies for other
> languages.

I do not want to talk about windows. That is a play OS in my eyes. But
just as I know, windows was using Unicode long before Unix even think
about that. And wasn't the CP850 stuff just used for DOS?

>  > I never had that problems with latin1 (except with only few software
>  > like luatex). But I had many problems in past with trying to use UTF-8.
>  > However, that personal stuff is good to know but does not help in this
>  > situation.
> 
> But now you have problems with Latin1.  The reason is that you still
> insist on archaic encodings like Latin1 while the rest of the world is
> striving towards Unicode.

Well, just because there is a bug in one software. Just because one
piece of software is wrongly programmed.

I do not want to join the UTF-8 religion. So please don't try to
proselytize me!

>  > Fact is that even software that uses UTF-8 (or other unicode) internal,
>  > work well in my environment. (Examples: Libreoffice, Gimp, Geeqie, ...
>  > (Geeqie, I am one of the people working on it)) So it must be possible
>  > to do that in lualatex or xetex too.
> 
> I don't know what you want to achieve.  You said:

Well, just work with any charset!? Most software does. Only few are
buggy.

By the way, pdflatex works well too with every charset. At the moment I
have no real reason to use lualatex. I just wanted to play a bit with
that and found this bug. I do not want to do religious discussions, my
intention is to report the bug and hopefully get it fixed.

Just the font handling is a bit better in lualatex. However, for my
work, I can stay with the way LaTeX2e is doing it.

>  > While latin1 can include every possible character, UTF-8 cannot.
> 
> This is definitely wrong.  The opposite is true.

No, it is true, independent from what you think. UTF-8 cannot display
any byte value. Some are just invalid but happens here and there.

Regards
   Klaus
- -- 
Klaus Ethgen                              http://www.ethgen.ch/
pub  4096R/4E20AF1C 2011-05-16   Klaus Ethgen <Klaus at Ethgen.de>
Fingerprint: 85D4 CA42 952C 949B 1753  62B3 79D0 B06F 4E20 AF1C
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQGcBAEBCgAGBQJTtm2sAAoJEKZ8CrGAGfas5c0L/1AHp7AdtFl9O69aWXofmKnf
Qj+RKhGKcgAxQJ2NvW0FO6Iv6RmYqL7gB0kiDoMdqloSSS/5/lfRKe+QSH4hjwjq
jDtOMoO6Rqq0sVlAMBO1TtnraJUUu1L3zEXxwUJjnfWgLguEE15mzCEqm7OyH8td
oE3pBIRa7UcpLRlt1xK/F8lKI4DPuNqQPeUZfyJQ0ERuxjTb2O87FHgRq9A80q6f
39XPr2tWVqutDHXU/0JNprE5BC+/pVVdjkFcP9C2XAsixUZ+UW5Op79Kco3kDI9N
w71luC6M0dvNpI0K69rEUJb8qDr6i+sFN7NZBf1oIq8oOENEesIXK8XvxOhnVxL5
S9x2DczF/+Bb13eRxWktA59iue6XMzg4Kyk71XN07JUMJm+K4lLKfx3Gh11mZ/lZ
keKiN9RtrgVguwaDyENqDHjxExSdq8L+uukp5KWtM25UwbduJT2zNG/U3xZJLi1F
Sei1BqZkInls/0utKqnYsxNnpqZmVWKXSayJPTzZeQ==
=Qzmj
-----END PGP SIGNATURE-----




More information about the tex-live mailing list