problem with foreign letters in names apparently from crossref,

Mike Marchywka marchywka at hotmail.com
Thu Feb 10 01:49:38 CET 2022


On Tue, Feb 08, 2022 at 08:06:01AM -0600, Don Hosek wrote:
>    On 8 Feb 2022, at 02:59, [mailto:texhax-request at tug.org]texhax-request at tug.org wrote:
> 
>    My c++ code uses a lot of typedefed strings that I guess could be easily
>    set to use wide characters and I have a char class parser that is perfectly
>    general upto to at least int size chars probably. However, I still use 8 bit char
>    for characters in places, routinely test based on ASCII etc.
> 
>    UTF-8 uses 8-bit values exclusively, but, for non-ASCII characters they’ll be represented as multi-byte sequences. So, for
>    example. ç (c-cedilla) is codepoint U+00C7 but in UTF-8 this will be two bytes: 0xC3 0xA7 (in some circumstances you may
>    also encounter the semantically equivalent c + combining cedilla which will be c then U+0327 which is represented in UTF-8
>    as 0xCC 0xA7).
> 
>    Pretty much any new code that deals with generalized inputs should assume that its input is UTF-8-encoded. UTF-8 has the
>    advantage that for any multi-byte sequence, the starting byte can always be identified as such (so the classic interview
>    problem of reverse this string can almost* be handled by determining whether a byte is a starting byte for a sequence or a
>    continuation byte—the classic interview problem’s contrived nature begins to show itself in the 21st century).
> 
Thanks. I thought this was going to be another infinite time sink
but it turned out to resolve pretty easily. This may be a bit
off topic but relevant to dealing with almost-ASCII lol.   
It looks like I had two problems. The first was in the html parser
despite being invoked with the same encoding flag produced one output
from the c file input and a diferent one from the c++ ifstream. Changing
the encoding parameter to indicate UTF8 helped a lot. The second problem
though was this char-class parser I have that tries to break up 
a string into groups of chars of the same class - letters, digits, whatever.
This works well for a lot of ad hoc stuff although the logic afterward
may be a bit contorted its often easier than trying to make 
a custom parser. I have a bunch of included and user defined
bits for things like printable, alpha, upper case, etc.
 The UTF8
sequences were singled out as breaks in the groups of letters. I 
found for now it is easier just to piece them back together although
I could modify the char class parser thing ( or just check the high bit )
to note an "atomic" group... 
 
After all that, it mostly seems to work now and I'm playing with
the email server. I had to add a bunch of things to remove
all the debug output ( all of that is a macro but I just
left it in adding  a global var for gating it out at runtime ). 

Once I started looking at it though I thought about creating
a simple text "ad generator" lol. Creating customized ads on the
fly in the context of a user query and current events
is kind of interesting and there are a lot of real time news
feeds to use :) However I do have important stuff to do ... 



>    Depending on the level of support you need,
> 
>    [http://utfcpp.sourceforge.net/]http://utfcpp.sourceforge.net
> 
>    should suffice for your needs. I’m guessing though that the biggest problem you’re running into is that you’re likely
>    running your code under Windows and haven’t done whatever is necessary to communicate to the OS that the program is
>    outputting UTF-8 which might be all that’s necessary.
> 
>    -dh
> 
>    * The almost comes into play when you remember that Unicode allows for combining character sequences like the
>    above-mentioned c+combining cedilla, not to mention oddities like some emojis which are generated by combining sequences
>    like bear+ZWJ+snowflake = polar bear, flags which are (mostly) two-character sequences of regional indicator letters, etc.

-- 

mike marchywka
306 charles cox
canton GA 30115
USA, Earth 
marchywka at hotmail.com
404-788-1216
ORCID: 0000-0001-9237-455X


More information about the texhax mailing list.