[texhax] The details of \csname, in this specific case

Sat Feb 23 19:54:38 CET 2013

Patrick Rutkowski wrote -

>So, I found a nifty little hack online, and I adapted it such that I
>can type my TeX sources in UTF8 and have macrons come out correctly.
>The code is pasted toward the bottom of this message.
>
>Before I get to my question, I should first stem a few obvious
>comments: Yes, I know XeTeX exists. But, I really like xdvi. For some
>reason xdvi finds XeTeX's dvi output unworkable, and so I'm sticking
>with straight up TeX, so I can keep using xdvi.

I think this means that you will be limited to using UTF-8 for only those 
Unicode code points in the range 128--255, because standard TeX's 
internals manipulate indivisible characters only as 8-bit quantities.

>Now, onto my actual questions. The below TeX code works, but I don't
>exactly know how. I understand what the \catcode is doing, and I
>understand \expandafter is doing. Naturally, I'm also very familiar
>with how UTF-8 internals work, with variable length sequences and all
>that good stuff. What I don't quite understand is what is inside of
>the \csname.
>
>1) I would have expected to have to encode c4 and c5 as something like
>`^^c4 and `^^c5 inside of the \csname, but somehow that is not
>required.
>
>2) I thought that \csname took only "character tokens," but wouldn't
>something like "c4" be two separate character tokens, first "c" and
>then "4"?

Correct.

>3) Moreover, what is that colon doing in between the c4 and the #1?

See below.  It's minimizing the risk of macro name conflict, and making 
things a bit easier to read.

>4) How exactly does TeX come to interpret the #1 as a "character
>token," aren't things above value 127 by default labeled "invalid?"

I think only ASCII 127 (DEL) is labeled invalid.  Post-7-bit-TeX, bytes 
above 127 in the input stream are usually classified as "other" 
characters (as opposed to letters), so that they pass on through to being 
placed in the layout one at a time without affecting the interpreter's 
execution context.

>5) And finally, why exactly is the single quote needed before the ^^c4
>for the \catcode, but not in the \def?

The \catcode primitive sees the reverse apostrophe ` and then looks ahead 
without expanding for a single byte representing a character in the range 
0-255.  But if it finds a macro whose name is exactly 1 character long 
(in which case it doesn't have to be a letter character), it uses that 
macro name's single character (after the backslash) as the byte it was 
looking for.  (This likely implicitly enters the macro name into the 
internal dictionary if it was not there before, and leaves it alone if it 
was already there, but that's an internal detail not really relevant).  
It's not clear why a 1-character macro name rather than just the byte is 
being used here.  Perhaps it's a TeX idiom to avoid problems if the ^^c4 
character is already active.

So, after all that rigmarole, the redefinition of the character bytes 
^^c4 and ^^c5 to being active characters (13) occurs successfully.  The 
purpose is so that some other macro can be fired the moment the scanner 
encounters one of these two bytes (a ^^c4 or a ^^c5) on the input.  Those 
two active-character-bound macro definitions follow.  They are only 
temporary macros in the service of defining a bunch of other macros.

What TeX does when the ^^c4 byte is encountered in the input is (a) it 
collects two arguments following.  The first argument #1 is expected to 
be a single byte that is the second byte of the two-byte UTF-8 sequence 
(TeX just sees it as another input character likely of type "other").  
The second argument #2 is collected as a brace-enclosed set of tokens (or 
it could be another single byte character, but the usage below is the 
former case).

When the macro representing active character ^^c4 is executed, it itself 
defines a new macro.  But that new macro's 4-character name is 
non-standard (it has non-letters in it, including a colon).  The four 
characters are all the bytes between \csname and \endcsname, but treated 
as if they were letters to form the entire name.  So if argument #1 were 
matched to character Z, the new macro's name would be "c4:Z" (without 
quotes, and eliding the backslash).  Typically, though, argument #1 will 
be a legal second/final UTF-8 byte.

The new macro so defined when ^^c4's active character macro is fired 
takes no arguments.  But it is followed by brace-enclosed substitution 
body, which is filled in with whatever was collected into #2 in the 
original active character macro being executed to define the new macro 
with the weird name.

After these definitions are the invocations of various macros that when 
executed, leave a macro-mapping between pairs of UTF-8 bytes encountered 
sequentially on the input and individual TeX glyphs/character or names of 
character-codes (like \i).

Then, after those special mapping definitions are made and executed, the 
two active character macros are redefined to simply invoke the various 
mapping macros previously defined, without causing a redefinition of 
anything.  The scanner never sees any macro names with colons in them, it 
only sees active characters mapped to predefined macros with weird names, 
and calls them to substitute the final macron characters.

It's all a clever, egregious hack, of course, but it works for 2-byte 
UTF-8 sequences, or at least a subset of them.

Anyway, that's what I can glean from your quoted code; I've not used it 
myself.

>I've tried to scour the TeXbook for these answers, but I've come up
>short handed.
>
>=== [ BEGIN PASTE ] ===
>\catcode`\^^c4=13
>\catcode`\^^c5=13
>\def^^c4#1#2{\expandafter\def\csname c4:#1\endcsname{#2}}
>\def^^c5#1#2{\expandafter\def\csname c5:#1\endcsname{#2}}
>ŸÅ{\=a}
>Ÿì{\=e}
>Ÿ´{\=\i}
>‰ç{\=o}
>‰´{\=u}
>ŸÄ{\=A}
>Ÿí{\=E}
>Ÿ{\=I}
>‰å{\=O}
>‰{\=U}
>\def^^c4#1{\csname c4:#1\endcsname}
>\def^^c5#1{\csname c5:#1\endcsname}
>
>ŸÅŸìŸ´‰ç‰´ŸÄŸíŸ‰å‰
>
>\bye
>=== [ END PASTE ] ===
>
>Many thanks

Doug McKenna