problem with foreign letters in names apparently from crossref,

Mike Marchywka marchywka at hotmail.com
Tue Feb 8 16:12:53 CET 2022


On Tue, Feb 08, 2022 at 08:06:01AM -0600, Don Hosek wrote:
>    On 8 Feb 2022, at 02:59, [mailto:texhax-request at tug.org]texhax-request at tug.org wrote:
> 
>    My c++ code uses a lot of typedefed strings that I guess could be easily
>    set to use wide characters and I have a char class parser that is perfectly
>    general upto to at least int size chars probably. However, I still use 8 bit char
>    for characters in places, routinely test based on ASCII etc.
> 
>    UTF-8 uses 8-bit values exclusively, but, for non-ASCII characters they’ll be represented as multi-byte sequences. So, for
>    example. ç (c-cedilla) is codepoint U+00C7 but in UTF-8 this will be two bytes: 0xC3 0xA7 (in some circumstances you may
>    also encounter the semantically equivalent c + combining cedilla which will be c then U+0327 which is represented in UTF-8
>    as 0xCC 0xA7).

lol, I'm right there now :) This is a big distraction for me, right now I just
wanted to accomodate the xml and that looks like it is coming together
as I have all the pieces. 
If you are interested however, 
the chars however are jumbled. For example,
"Gonzales" ( my english lol ), began ok afaict using printf,
( this just runs an html parser on the input xml and isolates the name )

 toobib -hhtml ref/jaxk.xml 2>/dev/null | grep Gon | sed -e 's/.*Go//'  | od -ax
0000000   n   C   '   a   l   v   e   s  nl
           c36e    61a7    766c    7365    000a
0000011
marchywka at happy:/home/documents/cpp/proj/toobib$ toobib -hhtml ref/jaxk.xml 2>/dev/null | grep Gon | sed -e 's/.*Go//'  
nçalves

But the output of "TooBib" the bibtex generator  ( or standalone c++ clas to be intergrated into
TooBib ), 


 echo load ref/jaxk.xml | ./a.out  2>&1  | grep "|Gon" | sed -e 's/.*Go//' | tail -n 1 
nçalves
marchywka at happy:/home/documents/cpp/proj/toobib$ echo load ref/jaxk.xml | ./a.out  2>&1  | grep "|Gon" | sed -e 's/.*Go//' | tail -n 1  | od -ax
0000000   n   C etx   B   '   a   l   v   e   s  nl
           c36e    c283    61a7    766c    7365    000a
0000013

As long as bibtex accepts it I'm just going to leave it there for now.

Thanks...

> 
>    Pretty much any new code that deals with generalized inputs should assume that its input is UTF-8-encoded. UTF-8 has the
>    advantage that for any multi-byte sequence, the starting byte can always be identified as such (so the classic interview
>    problem of reverse this string can almost* be handled by determining whether a byte is a starting byte for a sequence or a
>    continuation byte—the classic interview problem’s contrived nature begins to show itself in the 21st century).
> 
>    Depending on the level of support you need,
> 
>    [http://utfcpp.sourceforge.net/]http://utfcpp.sourceforge.net
> 
>    should suffice for your needs. I’m guessing though that the biggest problem you’re running into is that you’re likely
>    running your code under Windows and haven’t done whatever is necessary to communicate to the OS that the program is
>    outputting UTF-8 which might be all that’s necessary.
> 
>    -dh
> 
>    * The almost comes into play when you remember that Unicode allows for combining character sequences like the
>    above-mentioned c+combining cedilla, not to mention oddities like some emojis which are generated by combining sequences
>    like bear+ZWJ+snowflake = polar bear, flags which are (mostly) two-character sequences of regional indicator letters, etc.

-- 

mike marchywka
306 charles cox
canton GA 30115
USA, Earth 
marchywka at hotmail.com
404-788-1216
ORCID: 0000-0001-9237-455X


More information about the texhax mailing list.