[Fontinst] Encoding inconsistency?

Hilmar Schlegel hschlegel at ubcom.de
Wed Dec 3 21:46:11 CET 2003



Lars Hellström wrote:
> 
> At 10.30 +0100 2003-12-03, Ulrich Dirr wrote:
> >Hi,
> >
> >while experimenting with OpenType fonts I came across some
> >inconsistencies in encoding files.
> 
> I'm not entirely surprised by this. The fontinst ETX and MTX files have a
> long history, and during that time some things have changed.
> 
> >I have used cm-super-t1.enc as I did not find a t1.enc (what is the
> >canonic t1 encoding file? Cork.enc?).
> 
> Good question (implies: I don't know the answer). There is a fair chance
> there may not be a canonical ENC file for T1.

Exactly, typically one must make up an ENC for the specific font in
question to catch all provided charstrings (there might be some folks
who would prefer "glyphstrings" instead ;-)

One can cope with the common cases (e.g. when there are only two
"standards") by using fontinst names as meta names or substitute an
alias.
In the worst case one would first need to assign for every font's names
an individual ENC.

> >When compared with t1.etx
> 
> A better object of comparison would be t1draft.etx in
> fontinst/doc/encspecs/. t1.etx just aims to produce useful fonts.

...

> 
> >/hyphen.alt   /hyphenchar
> >/Ng           /Eng          % Adobe Eng;014A
> >/Tcedilla     /Tcommaaccent % Adobe both Tcedilla;0162
> >Tcommaaccent;0162
> 
> This is an old headache, which also involves Scedilla. When T1 was designed
> (and for many years afterwards) it was believed that the comma accent (as
> used in e.g. Romanian and Latvian) was the same thing as the cedilla
> accent, and thus glyphs was named accordingly. Then it was "discovered"
> that they were different, and ever since then things have been confused.
> Generally, you can't trust the glyph name to accurately describe the glyph
> in this case.

Indeed (coming back to the worst case) Unicoding a font means actually a
specific codepage with lots of holes. The collision happens actually on
the OS-level by MS' definitions of codepages: there are two characters
scommaaccent and scedilla for one Unicode slot. Unicode has cured that
meanwhile by providing "traditional", "intermediate" and "fixed" codes,
for both conflicting scommaaccent, scedilla and non-conflicting
tcommaaccent tcedilla. In the latter case there is no conflic because
there is no such character like "tcedilla" (see remark of Lars).
If you access fonts by silly Unicode indices instead more intelligent
via character names you are simply lost since different fonts (large TT,
OT-TT and OT-cff) provide different "Unicodepages" (i.e. versions of
Unicodes) filled with the more or less correct character definitions.
Specifically. if you use the Adobe or Linotype OT fonts you can access
the correct characters via the "fixed" Unicodes (current). However in
case you change the font to MS-TT fonts you will get nothing since they
implement only the "traditional", i.e. wrong codes (and moreover wrong
shapes). The alternative would be either nothing or the wrong shape of
tcommaaccent and a single available code for scedilla+scommaaccent
(obviously in 50% of the applications the shape is wrong too ;-).

Given this situation, Unicode is useful for document transport since
from context one can well discriminate between the meaning of the
015e-code. For a font however one must use either 015e/Scedilla or
0218/Scommaaccent. 
You could use *both* 0162/"Tcedilla" and 021a/Tcommaaccent if the font
is made correctly (Linotype/Adobe). Since the wrong name is less
important than the wrong charstring Adobe has decided to map both names
onto the  Unicode ("traditional") via ATM and Distiller.

Since we have in Tex the great opportunity to work on character names
instead of Unicodes, one should see first if the font provides a proper
Tcommaaccent or uni021A, and encode these. If not, one must decide to
see if a "Tcedilla" is available and use this in position 0162 to at
least print something or alternatively construct a T with comma below.

ATM/Distiller make Scommaaccent accessible via Unicode ("fixed") 0218
(uni0218) if the font has the charstring (e.g. CE-fonts).

How ATM & Distiller work on the *names* is defined in AGL, what is not
covered there must be accessed via unixxxx names to get at the
font-specific *codes*.

Conclusion: the inconsistency is actually in the fonts and one can cope
with the situation in fontinst with font-specific encodings. In case the
fonts provide wrong charstrings one must decide between PDFs from which
the text can be extracted/indexed/searched without "holes" and visual
acceptable virtual fakes constructed by fontinst.

> >/Germandbls   /SS           % not in Adobe's glyph list
> 
> Making SS a character in T1 was probably mostly due to a desire to support
> \uppercase, but it's a rather pointless character. Last time I looked, it
> wasn't even in Unicode.

Indeed, uppercase: ß -> SS.

Hilmar Schlegel

###



More information about the fontinst mailing list