[texhax] The details of \csname, in this specific case
doug at mathemaesthetics.com
Sat Feb 23 19:54:38 CET 2013
Patrick Rutkowski wrote -
>So, I found a nifty little hack online, and I adapted it such that I
>can type my TeX sources in UTF8 and have macrons come out correctly.
>The code is pasted toward the bottom of this message.
>Before I get to my question, I should first stem a few obvious
>comments: Yes, I know XeTeX exists. But, I really like xdvi. For some
>reason xdvi finds XeTeX's dvi output unworkable, and so I'm sticking
>with straight up TeX, so I can keep using xdvi.
I think this means that you will be limited to using UTF-8 for only those
Unicode code points in the range 128--255, because standard TeX's
internals manipulate indivisible characters only as 8-bit quantities.
>Now, onto my actual questions. The below TeX code works, but I don't
>exactly know how. I understand what the \catcode is doing, and I
>understand \expandafter is doing. Naturally, I'm also very familiar
>with how UTF-8 internals work, with variable length sequences and all
>that good stuff. What I don't quite understand is what is inside of
>1) I would have expected to have to encode c4 and c5 as something like
>`^^c4 and `^^c5 inside of the \csname, but somehow that is not
>2) I thought that \csname took only "character tokens," but wouldn't
>something like "c4" be two separate character tokens, first "c" and
>3) Moreover, what is that colon doing in between the c4 and the #1?
See below. It's minimizing the risk of macro name conflict, and making
things a bit easier to read.
>4) How exactly does TeX come to interpret the #1 as a "character
>token," aren't things above value 127 by default labeled "invalid?"
I think only ASCII 127 (DEL) is labeled invalid. Post-7-bit-TeX, bytes
above 127 in the input stream are usually classified as "other"
characters (as opposed to letters), so that they pass on through to being
placed in the layout one at a time without affecting the interpreter's
>5) And finally, why exactly is the single quote needed before the ^^c4
>for the \catcode, but not in the \def?
The \catcode primitive sees the reverse apostrophe ` and then looks ahead
without expanding for a single byte representing a character in the range
0-255. But if it finds a macro whose name is exactly 1 character long
(in which case it doesn't have to be a letter character), it uses that
macro name's single character (after the backslash) as the byte it was
looking for. (This likely implicitly enters the macro name into the
internal dictionary if it was not there before, and leaves it alone if it
was already there, but that's an internal detail not really relevant).
It's not clear why a 1-character macro name rather than just the byte is
being used here. Perhaps it's a TeX idiom to avoid problems if the ^^c4
character is already active.
So, after all that rigmarole, the redefinition of the character bytes
^^c4 and ^^c5 to being active characters (13) occurs successfully. The
purpose is so that some other macro can be fired the moment the scanner
encounters one of these two bytes (a ^^c4 or a ^^c5) on the input. Those
two active-character-bound macro definitions follow. They are only
temporary macros in the service of defining a bunch of other macros.
What TeX does when the ^^c4 byte is encountered in the input is (a) it
collects two arguments following. The first argument #1 is expected to
be a single byte that is the second byte of the two-byte UTF-8 sequence
(TeX just sees it as another input character likely of type "other").
The second argument #2 is collected as a brace-enclosed set of tokens (or
it could be another single byte character, but the usage below is the
When the macro representing active character ^^c4 is executed, it itself
defines a new macro. But that new macro's 4-character name is
non-standard (it has non-letters in it, including a colon). The four
characters are all the bytes between \csname and \endcsname, but treated
as if they were letters to form the entire name. So if argument #1 were
matched to character Z, the new macro's name would be "c4:Z" (without
quotes, and eliding the backslash). Typically, though, argument #1 will
be a legal second/final UTF-8 byte.
The new macro so defined when ^^c4's active character macro is fired
takes no arguments. But it is followed by brace-enclosed substitution
body, which is filled in with whatever was collected into #2 in the
original active character macro being executed to define the new macro
with the weird name.
After these definitions are the invocations of various macros that when
executed, leave a macro-mapping between pairs of UTF-8 bytes encountered
sequentially on the input and individual TeX glyphs/character or names of
character-codes (like \i).
Then, after those special mapping definitions are made and executed, the
two active character macros are redefined to simply invoke the various
mapping macros previously defined, without causing a redefinition of
anything. The scanner never sees any macro names with colons in them, it
only sees active characters mapped to predefined macros with weird names,
and calls them to substitute the final macron characters.
It's all a clever, egregious hack, of course, but it works for 2-byte
UTF-8 sequences, or at least a subset of them.
Anyway, that's what I can glean from your quoted code; I've not used it
>I've tried to scour the TeXbook for these answers, but I've come up
>=== [ BEGIN PASTE ] ===
>=== [ END PASTE ] ===
More information about the texhax