[tex-live] TL expl3 update broke a mwe for me

Sun Jan 3 09:02:18 CET 2016

There it is: When luatex reads CaseFolding it fails to read over a UTF-8 Sequence in a comment!

There are two errors. 1st, the luatex parser should not try to read "control sequences" in comments, should it? 2nd, a file that contains ASCII representations of unicode characters should onl have comments in the 7bit ascii range.

Replace line 20 of CaseFolding.txt with 

	# full case foldings are superior: for example, they allow "MASSE" and "Ma(sharp s)e" to match.

for example and the luatex error is gone.

regards,

ingo

> Ah, no, I get the same error. But what is latin9 luainputenc anyway?
> 
> It seems to be broken. Do you really need latin-X?
> 
> Convert all your source files to utf8 (with iconv for example) and leave the bad old codepages for people who still run windows 95 or whatever.
> 
> regards
> 
> ingo
> 
> 
>>> ! Undefined control sequence.
>>> <argument> ...or: for example, they allow "MASSE" and "MaÃŸ
>>>                                                   e" to match. 
>>> l.4195   \__unicode_map_inline:n { CaseFolding.txt }
>> 
>> This looks like an encoding error.  It would help if you copy and paste the strange output into od or xxd for example.
>> 
>> Your non ascii sequence seems to be C3 83 C2 9F, which appears as a double UTF-8 encoding or something similar. Either the encoding of your mail, the encoding of your system or the encoding of the CaseFolding.txt file is bad, I would bet.
>> 
>> With your numbers above, written in binary form you have:
>> 
>> 	11000011 10000011
>> 
>> and
>> 
>> 	11000010 10011111
>> 
>> that are quickly calculated into ascii / unicode numbers through the guessed utf-8 encoding
>> 
>>           01.   x in [000000.00000000.0bbbbbbb] → 0bbbbbbb
>>           10.   x in [000000.00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
>>           11.   x in [000000.bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb
>>           100. x in [bbbbbb.bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
>> 
>> where we just need the 2nd (10) rule, here.
>> 
>> 	decode_utf8(11000011 10000011) = 000 1100 0011
>> 	decode_utf8(11000010 10011111) = 000 1001 1111
>> 
>> This again is a UTF-8 sequence (guessed again).
>> 
>> 	decode_utf8(11000011 10011111) = 1101 1111 = DF
>> 
>> 	unicode DF = ß	(latin small letter sharp s)
>> 
>> So "Masse and Maße" match.
>> 
>> First shot: What is your system encoding. Most systems now use UTF-8 encodings. Check your locale, by just typing locale. This is an output for my system:
>> 
>> 	# locale
>> 	LANG=en_US.UTF-8
>> 	LC_CTYPE=de_DE.UTF-8
>> 	LC_NUMERIC=de_DE.UTF-8
>> 	LC_TIME=de_DE.UTF-8
>> 	LC_COLLATE=de_DE.UTF-8
>> 	LC_MONETARY=de_DE.UTF-8
>> 	LC_MESSAGES="en_US.UTF-8"
>> 	LC_PAPER="en_US.UTF-8"
>> 	LC_NAME="en_US.UTF-8"
>> 	LC_ADDRESS="en_US.UTF-8"
>> 	LC_TELEPHONE="en_US.UTF-8"
>> 	LC_MEASUREMENT="en_US.UTF-8"
>> 	LC_IDENTIFICATION="en_US.UTF-8"
>> 	LC_ALL=
>> 
>> Try your example with a utf8 system encoding.
>> 
>> regards
>> 
>> ingo