Non-ASCII characters in filenames/Unicode

Norbert Preining norbert at preining.info
Wed Mar 9 03:24:55 CET 2022


Hi John,

On Mon, 07 Mar 2022, John Collins wrote:
> I am updating latexmk so that it more correctly handles non-ASCII characters
> in filenames.  Particular problems are caused by the difference between the
> UTF-8 encoding used in .log, .fls, and .aux files etc, and the typically
> non-UTF-8 system code page used in Windows.  I have two questions for the TL

Huge pain.

> 1. Where do I look for how these things are done in the TL programs,
> especially any that are written in Perl?  Who should I ask about this?

Here is fine. I have no *real* recollection but that we have fought many
times with the same problem.

The places where one has to take care is output to the console
(STDOUT and STDERR), encoding of file names, and encoding of file
content.

Since neither install-tl nor tlmgr creates or handles files with
non-ascii file names, we happily ignore that ;-)

For the output to console we use the following in install-tl:
```
use Encode::Alias;
eval {
  require Encode::Locale;
  Encode::Locale->import ();
  debug("Encode::Locale is loaded.\n");
};
if ($@) {
  if (win32) {
    die ("For Windows, Encode::Locale is required.\n");
  }

  debug("Encode::Locale is not found. Assuming all encodings are UTF-8.\n");
  Encode::Alias::define_alias('locale' => 'UTF-8');
  Encode::Alias::define_alias('locale_fs' => 'UTF-8');
  Encode::Alias::define_alias('console_in' => 'UTF-8');
  Encode::Alias::define_alias('console_out' => 'UTF-8');
}
binmode (STDIN, ':encoding(console_in)');
binmode (STDOUT, ':encoding(console_out)');
binmode (STDERR, ':encoding(console_out)');
```

As far as I remember, the reason for that was the former GUI which was
written in PerlTK and had translated strings which were also output to
the console/terminal. With the above the encoding worked also for
non-utf8 based consoles (like Windows).

But I might be wrong about this, too ...

For the rest, as Karl said, we try hard to keep everything in ascii what
we output ;-) And all the files we create are in ASCII file names, too.



If you want to deal with non-ascii file names, I suggest looking at
https://github.com/texjporg/cjk-gs-support/blob/master/cjk-gs-integrate.pl
where dealing with non-ascii filename is important ( CJK file names).

We use
```
use utf8;
use Encode;

sub encode_utftocp {
  my ($foo) = @_;
  $foo = Encode::decode('utf-8', $foo);
  $foo = Encode::encode('cp932', $foo);
  return $foo;
}

sub encode_cptoutf {
  my ($foo) = @_;
  $foo = Encode::decode('cp932', $foo);
  $foo = Encode::encode('utf-8', $foo);
  return $foo;
}
```

and for example:
      $targetname = encode_utftocp($targetname);
      my $cmdl = "cmd.exe /c if not exist \"$targetname\" mklink ";

or
      # cp932 for win32 console
      if (win32()) {
        $fn = encode_utftocp($fn);
      }

I hope that helps.

> 2. On a Windows system where the **system** code page is not 65001, the PWD
> line at the top of an fls file is encoded in in the system code page.  The
> rest of the file is encoded in UTF-8.  Is this a bug?

Good point ... something Karl should probably answer. kpathsea should
use the same encoding for the whole I guess.

All the best

Norbert

--
PREINING Norbert                              https://www.preining.info
Fujitsu Research     +    IFMGA Guide     +    TU Wien    +    TeX Live
GPG: 0x860CDC13   fp: F7D8 A928 26E3 16A1 9FA0 ACF0 6CAC A448 860C DC13


More information about the tex-live mailing list.