[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Tidbit from comp.text.tex...



Sometimes I wonder just why we have this list. Surprisingly often I
find that some information has been circulated in a TUG issue, in a
book, or in this case, in a newsgroup, without it ever getting
mentioned in this forum -- not even a pointer.The enclosed was posted
by Berthold K.P. Horn in comp.text.tex.

Opinions welcome,

     Melissa.

Enc.

From: bkph@ai.mit.edu (Berthold K.P. Horn)
Newsgroups: comp.text.tex
Subject: font encoding problems and TFM checksum
Date: 01 Jan 1998 21:48:52 -0500
Organization: MIT Artificial Intelligence Lab
Message-ID: <vux1zyr8uaz.fsf@rice-chex.ai.mit.edu>

=====================================================================
  A way to help diagnose and cure font encoding mismatch problems.
=====================================================================

When using fonts other than Computer Modern, it is very important that
three entities agree on text font encoding (character layout):

   (i)   The TFM file must be set up for the desired encoding;
   (ii)  The DVI processor must reencode fonts to that encoding;
   (iii) We must tell (La)TeX about that encoding;

If these three do not match, one typically ends up with missing ligatures,
missing accented and special characters, or bizarre substitutions (like
emdash for fi, or the registered trademark for endash).  In some cases
`missing character' messages may appear in the log file - but not on screen -
and hence typically are not seen.  Since most encodings have the
alphanumerics in the same place, the output may look superficially OK.
Problems with encoding sometimes are then only found later (sometimes
only after printing a book!).

In the past this was not much of a concern since MF based PK bitmapped fonts
do not have the flexibility of different encoding choices - they all use
fixed character layouts.  It is, however, a serious concern when using
scalable fonts, such as fonts in Adobe Type 1 format (a.k.a. PostScript fonts
or ATM fonts) and TrueType fonts.

   Endless questions and problem reports on comp.text.tex atest to this!

The trick is to find a way to pass information to the DVI processor about
what encoding the TFM file was set up for.  This information must somehow be
`hidden' in the DVI file itself.  The TFM checksum can be used for this.

Presently the checksum in TFM files serves a very limited purpose.  A
checksum mismatch provides only *one* bit of information, namely whether the
checksum in the DVI file matches that in `the font'.  It can be hard to
interpret the significance of a mismatch, or to decide whether it is safe to
ignore such warnings - as it often is.

With increasing use of fonts other than bitmapped PK fonts, and with
increasing exposure to `encoding problems,' there is an opportunity now to
use the checksum in a more useful way.  There is also a significant need,
since TFM files are `encoding sensitive,' and using the wrong TFM file can
lead to serious problems. This is something that occurs more and more
frequently when using fonts other than Computer Modern, or when processing
DVI files produced at another installation that may be set up differently.

The checksum in the TFM file is one of the few parts of a font's metric
information that is carried forward into the DVI file.  It provides a unique
opportunity to `hide' information about a font's encoding.  This can then be
checked by the DVI processor - whether it be a previewer or printer driver.
This way one can be sure that the encoding used by the DVI processor matches
that used when creating the TFM file.

The checksum is only a word of 32 bits, so there are obvious limits on how
much information can be encoded in it.  We cannot, for example, hope to
include the full encoding vector, which might be 256 glyph names, each up to
32 characters long (or say 256 UNICODE numbers each in the range 0-65536).
Clearly we can only get in the *name* of the encoding vector (or part
of it anyway).

Used in the obvious way, one can get only 4 characters into the checksum
word.  Since the case of the name of the encoding vector is not very
important and since most such names are alphanumeric, one could consider
packing in 6 letters of the encoding vector name using base 36 coding.
Since 40^6 < 2^32 (and 41^6 > 2^32), one can actually squeeze in 4 more
character codes using base 40 encoding instead.  This is useful for -, _,
&, and a code for indicating that a character in the name was not one of
the above.

While this is less than ideal, it is about the best one can do.
The only restriction is that encoding vector names are unique in
the first 6 letters, contain alphanumerics only (and are agreed upon).

Note that with such a scheme, the DVI processor can provide much more
useful information in case of a mismatch.  A typical error message might be:

Font `tir' set up for `texnan..' encoding, but DVI processor set up for `8r'

   A lot more illumninating than `Checksum mismatch'!

Of course, one might instead simply try to suppress the incidence of font
encoding problems by legislating that everyone use the *same* encoding for
everything.  But this isn't very practical given the very real shortcomings
of any given `standard' encoding.  And even in such an ideal world with a
legislated encoding scheme, one ends up with TFM files for more than one
encoding (for example, T1 may be based on 8r and so TFM files for both
T1 and 8r must coexist).

Another approach to dealing with this issue is to decorate the TFM file
names with abbreviations of the encoding used to create them.  This helps,
but leads to longer TFM file names and is less certain than having the
information wired into the TFM file itself - save from file renaming
adventures.

The checksum algorithm used by AFM2TFM and PStoPK has changed over time.
The `new' version is a hash code derived from the widths in the AFM file
and the encoding vector.  As a hash code it is not invertible to recover
encoding information.

/**************************************************************************/

Here are some typical names for commonly used encoding vector files:

   textext.vec		OT1/Tex text encoding
   tex256.vec		T1/Cork encoding
   texnansi.vec		LY1/TeX 'n ANSI encoding
   texmac.vec		LM1/Textures encoding
   8r.vec		8r/TeX Base 1 encoding
   ts1.vec		TS1/Text companion encoding
   standard.vec		Adobe Standard Encoding
   ansinew.vec		Windows ANSI encoding
   mac.vec		Macintosh standard roman encoding

Here is some sample code (critique of programming style not invited :-)

/**************************************************************************/

/* Takes first 6 letters of encoding vector name and compresses it */
/* into 4 byte checksum using base 40 coding. NOTE: 40^6 < 2^32 */
/* Treats lower case and upper case the same. */
/* Anything but alphanumerics - and -, _, & - are mapped to 39 */
/* Returns checksum. Empty vector name yields zero checksum  */
/* Short encoding vector names are padded with zero bytes on the right */

unsigned long codefourty(char *vectorname) {
	unsigned long checksum=0;
	int c, k;

	if (strcmp(vectorname, "") == 0) return 0;

	for (k = 0; k < 6; k++) {
		if ((c = *vectorname) != '\0') vectorname++;
		if (c >= 'A' && c <= 'Z') c = c - 'A';
		else if (c >= 'a' && c <= 'z') c = c - 'a';
		else if (c >= '0' && c <= '9') c = (c - '0') + ('Z' - 'A') + 1;
		else if (c == '-') c = 36;		/* special case */
		else if (c == '&') c = 37;		/* special case */
		else if (c == '_') c = 38;		/* special case */
		else c = 39;				/* none of the above */
		checksum = checksum * 40 + c;
	}
	return checksum;
}

/* Decode 4 byte checksum to get first 6 letters of encoding vector name */
/* Writes result back into second argument */
/* Returns zero if checksum is zero */
/* This may be used for fonts with fixed encoding, such as math fonts */

int decodefourty(unsigned long checksum, char *vectorname) {
	int c, k;

	if (checksum == 0) {
		strcpy(vectorname, "fixed");	/* font uses fixed encoding */
		return 0;
	}
	for (k = 0; k < 6; k++) {
		c = (int) (checksum % 40);
		checksum = checksum / 40;
		if (c <= 'z' - 'a' ) c = c + 'a';
		else if (c < 36) c = (c + '0') - ('z' - 'a') - 1;
		else if (c == 36) c = '-';		/* special case */
		else if (c == 37) c = '&';		/* special case */
		else if (c == 38) c = '_';		/* special case */
		else c = '.';		/* not alphanumeric or special case */
		vectorname[5-k] = (char) c;		/* right to left */
	}
	vectorname[6] = '\0';				/* null terminate */
	return 1;
}

/**************************************************************************/

Berthold K.P. Horn		mailto:bkph@ai.mit.edu
Cambridge, Massachusetts, USA