problem with foreign letters in names apparently from crossref,

Don Hosek don.hosek at gmail.com
Tue Feb 8 15:06:01 CET 2022


On 8 Feb 2022, at 02:59, texhax-request at tug.org wrote:
> 
> My c++ code uses a lot of typedefed strings that I guess could be easily
> set to use wide characters and I have a char class parser that is perfectly
> general upto to at least int size chars probably. However, I still use 8 bit char
> for characters in places, routinely test based on ASCII etc.
> 

UTF-8 uses 8-bit values exclusively, but, for non-ASCII characters they’ll be represented as multi-byte sequences. So, for example. ç (c-cedilla) is codepoint U+00C7 but in UTF-8 this will be two bytes: 0xC3 0xA7 (in some circumstances you may also encounter the semantically equivalent c + combining cedilla which will be c then U+0327 which is represented in UTF-8 as 0xCC 0xA7).

Pretty much any new code that deals with generalized inputs should assume that its input is UTF-8-encoded. UTF-8 has the advantage that for any multi-byte sequence, the starting byte can always be identified as such (so the classic interview problem of reverse this string can almost* be handled by determining whether a byte is a starting byte for a sequence or a continuation byte—the classic interview problem’s contrived nature begins to show itself in the 21st century).

Depending on the level of support you need, 
http://utfcpp.sourceforge.net <http://utfcpp.sourceforge.net/>
should suffice for your needs. I’m guessing though that the biggest problem you’re running into is that you’re likely running your code under Windows and haven’t done whatever is necessary to communicate to the OS that the program is outputting UTF-8 which might be all that’s necessary.

-dh

* The almost comes into play when you remember that Unicode allows for combining character sequences like the above-mentioned c+combining cedilla, not to mention oddities like some emojis which are generated by combining sequences like bear+ZWJ+snowflake = polar bear, flags which are (mostly) two-character sequences of regional indicator letters, etc.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://tug.org/pipermail/texhax/attachments/20220208/00885cec/attachment.html>


More information about the texhax mailing list.