[tex-live] Problems with non-7bit characters in filename

Fri Jul 4 12:00:27 CEST 2014

Klaus Ethgen <Klaus+texlivelist at ethgen.ch> wrote:

> Am Fr den  4. Jul 2014 um  2:17 schrieb Reinhard Kotucha:
> > On 2014-07-04 at 00:08:42 +0100, Klaus Ethgen wrote:
> > 
> >  > Am Do den  3. Jul 2014 um 23:46 schrieb Zdenek Wagner:
> >  > > > I was pointed to this list to report the following Bug. Please put me in
> >  > [Bug in filesystem code]
> >  > > Lualatex is right, umlaut characters in latin1 are invalid sequences
> >  > 
> >  > Thats true. While latin1 can include every possible character, UTF-8
> >  > cannot. (possible as possible to have on the wire)
> > 
> > You misunderstood.  The opposite is true.  UTF-8 (Unicode) supports
> > all characters, Latin1 is a simple 8-bit encoding which supports only
> > Western European languages (except French).
> 
> Well, no, way around. In UTF-8, there are byte values that are invalide.
> In latin1 there is no invalide byte value.

you're confusing byte (octet in standards-speak) with character.  an
octet is 8bits ... full stop.

a character is an atom of some sort of language.  how you represent it
is not made clear by its name.

> > UTF-8 is the encoding of the future because it supports all languages
> > used today.  This is the reason why XeTeX and LuaTeX exist at all.
> 
> Well, no, not all programming languages supporting UTF-8 native.

i assume that you meant "natural languages".  programming languages are
artificial, and may declare in their spec what characters and encodings
they accept.

> I even think that programming languages and tools should be encoding
> agnostics and just use what they get.

a counsel of perfection, but it would be nice.  note that fully-
flexible encoding support requires totally specific encoding
specification.  for 8-bit characters, that's provided by iso 2022.  i
have tried working with it (writing other standards that require its
use) and it's a nightmare.

> > When I took over maintenance of VnTeX my OS still used Latin1.  It was
> > a pain!  I then switched to UTF-8 and everything worked fine.
> 
> Thats your experiences, mine are the complete opposite.
> 
> > IMO all these national ISO-2022/ISO-8859 encodings are archaic.  The
> > future is UTF-8.
> 
> Thats just a holy question. I don't think  they are archaic. They stay
> side by side with all the Unicode encodings.

in most practical applications i've seen, the encoding is assumed.
(hence the hiccups when you feed iso 8859-1 source to (lua|xe)tex.

> But unfortunately, the discussion about encodings is often just going
> onto a holy discussion.

no.  there's a practical core to the discussion.  you and i get on ok
with iso 8859-1 because (to first order) our countries wrote the iso
8859 standards to "just work" with our languages.

> >  > > in utf-8 but both luatex and xetex work internally in unicode. I
> >  > > am not sure whether it is possible to change interaction with
> >  > > file system encoding easily.
> >  > 
> >  > Why converting the filename at all? The file name is the same on
> >  > command line and on the file system. So without any reencoding
> >  > everything would be fine.
> > 
> > It's not always the case.  A German Windows is using CP1252 on the
> > command line and UTF-16 internally for file names.  It's a pain.

because m$ connfused things, from the start, by defining their own
character coding standards.

> That could be. I do not care about windows. But I believe that a TeX
> distribution must care about such cases.

even m$ have mostly usable overlap with iso latin codes.

> However, even in that case there is a clear path from one encoding into
> another. Even if not all characters are able to display in the other
> charset, TeX has to respect the console charset too. Why isn't that the
> same on Unix/Linux?

because we're still in transition.  i saw the start of this thread, and
thought "i would have forgotten that one, too".  this is plainly a
problem that needs solving (and will eventually be solved), but imo it's
not one to beat people over the head about.

> > Yes, AFAIK OpenOffice gratefully supports UTF-8.  You should have
> > configured your file system to use UTF-8.
> 
> I do not get the connection between openoffice and why I »should have«
> the file system in UTF-8. I do not use office suites.

i don't know why oo appeared here, either.

> To the point about using UTF-8 on filesystem. That would end in having
> filenames that are cannot manages anymore. No mv no rm, simply nothing
> would be possible with them if they have byte values in filename that
> are invalide in UTF-8. Such files will stay forever in filesystem. Even
> if the filesystem is clean in the begin, such erroneous byte values
> could happen in a crash or OS level error.

huh?  when you use your system, there should be notice of the encodings
applications should expect.  if i say my encoding is iso 8859-1, i
*must* accept that i can't open files whose names use non-latin 1
characters.  that's it.  if i use iso 8859-7 (greek) i should not expect
to be able to use files whose names are using iso 8859-5 (cyrillic) --
at the least because there are characters which are not shared between
those languages (neither Î© in greek nor Ð© in russian appears in the
'other' language).

> Even more, it will serve me with filenames I do not want to have ever on
> filesystem. (Like the latin a replaced by a Russian one. And this is
> just a annoyance, there are more sever code points in UTF-8.) Unpacking
> containers from other people is just to dangerous if using UTF-8 on the
> filesystem. The worst that could happen in latin1 is some crappy
> filename that has to be renamed. But no harm could happen there.
> 
> > This is the default for years (on Linux and OS/X, at least).
> 
> Don't use defaults without questioning and understanding them. I don't,
> I have good reasons why I use latin1. I have no reason to use UTF-8.

fair enough, but don't argue from your specific case to other people's
generic cases.  it just doesn't work.

> > Windows always lags 20..30 years behind and still insists on CP1252
> > (CP850 on the command line) for German and similar idiocies for other
> > languages.
> 
> I do not want to talk about windows. That is a play OS in my eyes. But
> just as I know, windows was using Unicode long before Unix even think
> about that. And wasn't the CP850 stuff just used for DOS?

despite your exaggeration, it's actually up with the best practice, if
that's what you want, and it will provide you with backward
compatibility to the archaic code pages if you need that.  windows may
be big and sluggish, and a pita to use, but it's not "out of it" as you
suggest.

once upon a time, all my programming was done in 5-bit baudot code (my
first programs were writting in the 60s.)  7-bit ascii was a great
advance.  8-bit latin-* standards improved things again, but now it's
increasingly becoming necessary to use systems programmed at one place
in the world, at another place, and we need to make systems at one place
work at another.  using unicode is a great support to such advances.

(and i retire in 2 months' time, and haven't written commercial code
since the 1990s.)

robin
the logorrheic.