[XeTeX] Unicode space characters

Tomáš Janoušek tomi at nomi.cz
Wed Mar 18 01:39:27 CET 2009


Hello Unicode TeXers,

I created a set of definitions for a few Unicode "space" characters and I
think these may be interesting to the community, and could possibly make it to
the distribution. I'd like to hear what you think about them and whether they
are correct (as I am not experienced in typography, and I have no idea about
non-Europian languages).

The motivation behind this:

I'm working on a software for typesetting exams. The university information
system contains facilities for electronic examination, such that students are
presented with a set of randomly chosen questions and choices, which they
answer and receive a score. The questions are defined in a domain specific
markup, with HTML allowed in the text. For obvious reasons, letting students
take exams at home is generally not a good idea, hence the exams may be
printed and answers scanned.

The software that typesets the exams has had a very limited functionality and
that's why I'm upgrading it. It generates a LaTeX source and processes it with
XeLaTeX. I added a more sophisticated support for HTML markup and a couple of
other tweaks. It usually works, but since the tests are in most cases typeset
in two columns, line breaking may cause trouble.

I've decided to let the users control the line breaking of problematic words
and for this I found a couple of Unicode characters, so I just defined their
behaviour in TeX. These seem to work correctly, but I'm not sure I interpret
their meaning correctly, especially in a context of foreign languages and
scripts. While most exams are in Czech or English, I suppose that correct
typesetting of other languages may be very appreciated by foreign language
teachers, who hadn't been able to use this software until now.

Here are the definitions:

> %% U+00A0 NO-BREAK SPACE;  
> %%   Unicode char for ~.
> \catcode`^^^^00a0=\active
> \def^^^^00a0{\nobreakspace}
> 
> %% U+00AD SOFT HYPHEN; ­
> %%   Unicode char for \-.
> \catcode`^^^^00ad=\active
> \let^^^^00ad=\-
> 
> %% U+200B ZERO WIDTH SPACE; ​
> %%   This character is intended for line break control; it has no width, but
> %% its presence between two characters does not prevent increased letter
> %% spacing in justification. (cited from Unicode)
> %%   In this interpretation, we use it to specify allowed line breaks (without
> %% hyphen) only. It should be easy to add some rubber space, if desired.
> \catcode`^^^^200b=\active
> \def^^^^200b{\hskip\z at skip}
> 
> %% U+200C ZERO WIDTH NON-JOINER; ‌
> %%   Ligature breaker.
> %%   It should be safe to assume that this is placed at the boundary of parts
> %% of a compound word, therefore we add a soft hyphen and break the word in
> %% two for the hyphenation algorithm.
> \catcode`^^^^200c=\active
> \def^^^^200c{^^^^00ad\nobreak\hskip\z at skip}
> 
> %% U+2060 WORD JOINER; ⁠
> %%   A zero width non-breaking space.
> \catcode`^^^^2060=\active
> \def^^^^2060{\nobreak\hskip\z at skip}

The last one is a bit more domain specific, but may be funny or interesting,
so I include it anyway :-).

> %% U+0082 BREAK PERMITTED HERE; ‚
> %%   Follows a graphic character where a line break is permitted. Roughly
> %% equivalent to a soft hyphen except that the means for indicating a line
> %% break is not necessarily a hyphen. Not part of the first edition of ISO/IEC
> %% 6429. (cited from Wikipedia)
> %%   In this interpretation, we annotate the line-breaks using arrows on both
> %% lines (like in emacs) instead of a hyphen.
> %%   To be used in verbatim blocks, like <code> or <tt>.
> \catcode`^^^^0082=\active
> \def^^^^0082{\discretionary{\copy\odp at BPHa}{\copy\odp at BPHb}{}}
> % Prepare the marks. This code needs to be adjusted if the main font is
> % changed.
> \newbox\odp at BPHa\newbox\odp at BPHb
> \begingroup\setbox0\hbox{^^^^2935}\setbox1\hbox{^^^^2937}
> \global\setbox\odp at BPHa\hbox{\lower.56ex\copy0\kern-\wd0}
> \global\setbox\odp at BPHb\hbox{\kern-\wd1\lower.02ex\copy1}
> \endgroup\dp\odp at BPHa=0pt

(the font used for the arrows is DejaVu Serif in my case)

Regards,
-- 
Tomáš Janoušek, a.k.a. Liskni_si, http://work.lisk.in/



More information about the XeTeX mailing list