[tex-live] announce: 2nd version of encTeX -- UTF-8 support

Petr Olsak olsak@math.feld.cvut.cz
Mon, 20 Jan 2003 08:32:01 +0100 (CET)


Dear TeXlive and web2c implementors,

I released the second version of my encTeX -- the extension of TeX for
input re-encoding. This version is available on

   ftp://math.feld.cvut.cz/pub/olsak/enctex/

including my pseudo English documentation on

   ftp://math.feld.cvut.cz/pub/olsak/enctex/encdoc-e.pdf

This version supports the full UTF-8 encoding of input files and
\write files in 8bit TeX, pdfTeX and eTeX.

I want to say a word of thanks to David Necas (Yeti) who has made a
encTeX's UTF-8 tables from UNICODE NamesList.

More information follows.

--------------------------------------------------------

The UTF-8 encoding keeps the standard ASCII characters unchanged and
encodes the accented letters of our alphabets in two bytes. The
standard 8bit TeX is not ready for the UTF-8 input because it have to
manage the single character as two tokens. It means you cannot set the
\catcode, \uccode, etc. to these single characters and you cannot do
\futurelet of the next character in normal sense. The second version
of my encTeX solves these problems.

The encTeX package is a little extension of TeX. You can install
it from source files of TeX by changing the "tex.ch" file in your
distribution. The patch to "tex.ch" file for web2c distribution is
included.

The encTeX is full backward compatible with the original TeX.
It adds eight new primitives by which you can set or read the conversion
tables used by input processor of TeX or used during output to the
terminal, log and \write files. If you don't use these primitives, the
program behaves 100% the same as the standard TeX.

The first version of the encTeX was released in 1997. This version
was able to do only byte per byte conversion using xord and xchr
vectors. The second version was designed in December 2002 and released
in January 2003. It gives possibility to convert the multi-byte
sequences to one byte or to arbitrary control sequence. You can implement
up to 256 UTF8 codes as one byte and unlimited number of other
UTF-8 codes as control sequences. All internals in 8bit TeX (macros etc.)
are working in the same way as if "normal one byte encoding" of input
files is used.

I think that the UTF-8 encoding will be used more and more common. In such
situation, there is no another way than to modify the input processor
of TeX otherwise the 8bit TeX will dead in short time.

---------------------------

I am ready to help you with the implementation of encTeX to the common
TeX distributions. The current state is that the encTeX is initialized
every times if the patch on tex.ch was used. The second line is added
to the banner of TeX, pdfTeX or eTeX, which informs you about encTeX
extension.

It is possible to implement encTeX such way that it is initialized
only if the -enc command line option was used. This is not done in
current version because the encTeX is implemented only at the WEB
language level independent on system specific subjects. The scanning of
the command line options is system specific part thus it is not
implemented now. But I am ready to cooperate with TeX implementors
to introduce the -enc option and to solve the TCX tables conflict.

Please, implement the encTeX into your TeX distributions. I would like
to the encTeX becomes a standard extension of TeX used in many TeX
distributions.

Thank you

Petr Olsak