[tex4ht] [bug #241] grave accent letter ` (hex 60) changes to left single quotation mark (hex 0xE2 0x80 0x98)

Sat Jan 17 18:33:46 CET 2015

Hi Karl,

> Meanwhile, aren't there options at the tex4ht level to decide whether to
> generate "unicode" (e.g., the unicode directed left quote) or not?
> I confess I have never had a good grasp on, or seen a comprehensible
> description of, all the multifarious options that Eitan created.  Aside
> from what you have written on your blog, and I fear I haven't even
> internalized those.

my understanding of the process is that for each tfm or vf file,
tex4ht post-processor search for corresponding .htf file. ascii code
or hehadecimal unicode codepoint is provided for each character
provided by the font file. these codes provided by the .htf file are
then translated using .4hf file to characters saved into the output
file.

structure of .htf files is described here:
http://www.tug.org/applications/tex4ht/mn-htf.html#index23-63001

example line:

’&#x02C6;’    ’’        2

Which .4hf file will be used in the translation process is directed by
`-c` command line option for tex4ht, this option selects section in
the .env file, so when we use `-cunihtf` for unicode output, this
section is selected:

<unihtf>
i~/tex4ht.dir/texmf/tex4ht/ht-fonts/unicode/!
i~/tex4ht.dir/texmf/tex4ht/ht-fonts/ascii/!
i~/tex4ht.dir/texmf/tex4ht/ht-fonts/alias/!
</unihtf>

so .4hf files in these directories are used (they seems to be always
named unicode.4hf and saved in charset subdir). because .4hf
referenced in `unihtf` section doesn't contain many characters,
majority of accents are outputed as html entities, as they were
provided in .htf files.

when we add `-utf8` option for tex4ht, I think tex4ht translates
unicode entities to unicode characters directly.

btw, I think Nasser had found many errors in .htf files in last two
weeks and and also for many fonts, .htf files are missing. so I
started investigating whether it is possible to get unicode code
points for characters in fonts.

my idea is following: we can take property list of a tfm file and find
postscipt name of the character in corresponding .enc file. we can get
unicode code point for postscript name from glyphlist.txt and
texglyphlist.txt files included in TeX distribution.

I have found two obstacles:

1. virtual fonts, which references many other fonts, including other
virtual fonts. this is not the problem, we can load all needed files
and . but sometimes two or more glyphs are used to create character
(mainly accents), so we can't get post script name of such character
even if we knew encoding of referenced glyphs

2. I have found many tfm files, which declares custom encoding, but I
can't find .enc files for such encodings.

For example, when I list fonts used in `ntxmia` virtual font:

ntxmia=FONTSPECIFIC
txmia=FONTSPECIFIC
txsyc=FONTSPECIFIC
txr=TEX TEXT
ntxexb=UNSPECIFIED
rtxmio=FONTSPECIFIC
ntxsyralt=NTXMIAALTENCODING
txsyb=FONTSPECIFIC + MSBMENCODING
ptmr8r=TEXBASE1ENCODING
zxxrl7z=ADOBESTANDARDENCODING

I can find TEXBASE1ENCODING, but for these FONTSPECIFIC I have to use
google to find out actually used encoding and not always I find
anything. with afm2pl -V I can get properly list with glyph names in
comments, but not always these glyph names are useful.

to sum it up, I am trying to make lua scripts that can generate .htf
file for each virtual or normal font, but I am not sure if it's even
possible :)

https://github.com/michal-h21/htfgen

best regards,

Michal