[tex-live] packages with characters > 127

Thu Dec 31 13:32:38 CET 2009

On 31 Dec 2009, at 12:02, Robin Fairbairns wrote:

> 
>> I mean yes, in an ideal world all would be iso-2022 or UTF32 or whatever
>> universal encoding you select, but it isn't. And we have hundreds of
>> packages and files here, and billions out in the world (docuemnts
>> of users) with legacy encoding. Not being able to read them for
>> the most advanced engine is a bit strange.
>> 
>> Be forgiving with everything you get, but restrictive and strict 
>> with what you give yourself.
> 
> how does one "forgive" a non-standard 8-bit character when you thought
> you were reading utf-8?  ignore it? (how does the user find why their
> characters didn't appear?)  produce an error? (users will complain)
> produce a warning? (tex is already verbose enough, so users will
> complain)
> 
> the problem is, that unless the processor is told what encoding the
> document uses (of the huge numbers available, standardised,
> microsoft-used or largely private), it can't in general determine what
> the semantic of any character is.

Yes, this is precisely the problem. The vast majority of the documents (by which I mean files that *tex might be expected to read, whether they are actual texts to be typeset or packages of macro code) do not come with any reliable indication of the encoding they are using. In the "traditional" 8-bit TeX world, this means that unless the user is careful to ensure the input encoding, hyphenation patterns, font encoding, etc., are all properly co-ordinated, the results may be more or less incorrect. (8-bit) TeX itself makes few encoding assumptions, simply dealing with 8-bit bytes, but encoding assumptions (starting with ASCII and moving on - in many directions - from there) are implicit throughout our fonts and macro packages.

In the 8-bit world, the consequences of getting the encoding wrong may be obvious (pages of "garbage"), or they may be minor; for example, picking the wrong Latin-# encoding might have no visible effect, if the final document doesn't happen to use any of the characters that would be affected. Or perhaps the input and font encodings match, so the right glyphs appear on the page, but the hyphenation patterns were prepared for a different encoding, and so the optimal set of breaks are not found; many users will let this pass unnoticed. In general, the software in this situation cannot tell that something is wrong; it blindly processes the bytes and sends them to the output, and if there are encoding mismatches (between input, packages, fonts, ....) it will silently produce incorrect output.

In the world of Unicode engines, the problem simply becomes more obvious, because an engine that is interpreting the input byte stream as UTF-8 can (and should) actually detect and report when the input is invalid.

We could, of course, treat all input text as using an 8-bit encoding by default, but then we'd face the question of WHICH 8-bit encoding, and (as already noted) most documents don't tell us. The only reliable way forward - indeed, the only reliable way to do ANY text processing, really - requires authors to provide (machine-readable) encoding metadata for their files. For those that don't provide it, we have to make some kind of default assumption, and I believe defaulting to UTF-8 is the most useful option.

JK