[pdftex] OT: Unicode and typesetting

Mon Apr 4 08:04:46 CEST 2005

I know Unicode is 'off topic' for this list.
OT, that is except for use as an input encoding (and this message is not 
about that, specifically).

What I do want to ask is the views of expert typesetters on whether Unicode 
measures up to its claims (or their expectations).

I have recently been typesetting a Japanese legal text and the following 
'features' have confused me.

Much is made of the fact that "the standard defines how characters are 
interpreted, and not how glyphs are rendered"[ref.1]
A common example are the 'round' (as in Adobe 'built in' AvantGarde) and 
'open' (as in built in Helvetica) lowercase a's. Yet many characters still 
duplicate.

1.	Character x3000 is a Japanese (really CJK) space. Our old 
friend x20 is the normal ASCII space.
The reason for having an ideographic space seems to be the need for a fixed 
width space (the same width as one ideograph). But surely that is a glyph 
issue?
Courier needs fixed width spaces .... , numbers in many (?all) fonts are 
fixed width (so accounting tables align) ...

There are lots more examples (fixed width brackets, to name but one other).
To me these seem to be locale issues (and solvable (_if_ they are a problem, 
at all) by declaring the language used, e.g. <div xml:lang="jp">FIXED WIDTH 
(JAPANESE) <span xml:lang="en">(the language of Japan, noted in Latin 
characters)</span> CHARACTERS</div>, or whatever.)

2.	This duplication of characters becomes even stranger with 
'foreign type glyphs that ASCII users might like', e.g.:

The Angstrom sign (&Acirc: in HTML speak, I think) has its own code point 
(x212B), different to 'A with ring above' (xC5) (let alone the fact that you 
can 'build your own' glyphs: x41 x30A).
That there is an Angstrom sign code point (and a degrees Celsius (x2103) and 
degrees Fahrenheit (x2109)) is a boon for text searching. One can find all 
the measurements in Angstroms in a text, even if that text is in a language 
that uses circles on top of vowels.
But being able to search for kilometres (let alone metres: 'm') would be 
equally (if not more) useful. 

It is not even as if x212B is some kind of symbolic link to xC5 for legacy 
purposes. There are two distinct code points.

3.	There is also a set of Roman numerals. Thus VII (x2166) and 
vii (x2176) exist.
Again indexing is not the main issue behind code points or glyphs (though it 
does have its uses!), but this would be useful for searching for the seventh 
article of an international convention, one could even build a synonym 
database where 7 (x37) is mapped to VII and vii (or vice versa).

This, though, only highlights the fact that there are no code points for 
'(g)', '(7)', '7.', etc. So this really is a cul-de-sac, indeed when one gets 
into the detail one again discovers the Roman numerals are really fixed-width 
ones to go with ideographs, for Latin text you are meant to use Latin 
alphabet letters (I think).
This leads to the potentially bizarre result that an 'intelligent' search 
alogarithm would find numeral 7 (x37), CJK fixed width 7 (xFF17), real 
Japanese 7 (x4E03), and CJK Roman VII's (x2166 and x2178) but not Latin 
script VII or vii !!

So, Unicode is definitely a big advance on all those dozens of (often 
corrupted by M$) code tables, but is it really a set of 'characters' which 
leaves glyph selection/representation to the rendering engine, or is it a 
peculiar mixture of characters and glyphs that may not only make many 
problems for the future interpretation of electronic files, but also is not 
that easy to intelligently use (e.g. searching) or even render as glyphs?

How do others feel?

	Michael Chapman.

[ref.1] "The difference between identifying a code value and rendering it on 
screen or paper is crucial to understanding the Unicode Standards role in 
text processing. ... ... 
" ... the standard defines how characters are interpreted, and not how glyphs 
are rendered ..." 'The Unicode Standard Version 3.0, April 2000, page 5.

This issue is discussed further on page 298, which says use of many of the 
_given_ code points is "strongly discouraged" ... (?).