[LaTeX] beamer and inputenc (utf8x) issue

Don Hosek don.hosek at gmail.com
Fri Jul 1 22:59:40 CEST 2022


I suppose, one could convert the file to NFD normalization. Then á would be represented by the sequence a + ´ and then you would grab a rather than á. I’ve actually spent a bunch of time implementing unicode segmentation for finl (the existing Rust code for segmentation didn’t have the right interface for my needs). Aside from UTF-8 multi-byte sequences (which is anything not in 7-bit ASCII), there are all the assorted combining characters which include all manner of combining characters not to mention combining Jamo for Hangul, multi-character glyphs like country and region flags and some Emoji built by combining base characters using Unicode ZWJ. 

-dh

> On 1 Jul 2022, at 14:56, Dr Nicola L C Talbot via tex-live <tex-live at tug.org> wrote:
> 
>>>> If your input is in UTF8, it is better to use an engine working
>>>> internaly in Unicode, i.e luatex or xetex.
>>> 
>>> Millions of documents use utf8 with pdflatex without problems. I run
>>> 95% of my documents with pdflatex. They are all utf8-encode and as
>>> I'm german my texts do contain umlauts and other non-ascii chars.
>>> The only thing that pdflatex can't handle are combining accents.
>>> 
>>> 
>> I am not that lucky, most of my old pdflatex documents fail. I often
>> have \everypar containing a macro with one parameter for setting an
>> initial. If I output it as {\otherfont #1} and the token is the first
>> octet of a multioctet character, it fails. Character V as an initial
>> needs an extra kern thus if the macro contains \if#1V and #1 is the
>> first octet of a multioctet character, it fails. I often use
>> \futurelet\testchar\dosomething and if \testchar becomes the first
>> octet of a multioctet character, \dosomething fails. And it happens
>> even without hyperref. I stopped using pdflatex a few years ago. Now I
>> have 15 versions of TeX Live installed and when I have to recompile an
>> old document, I go back in history and try, in which version of TL the
>> document works. It is quite common for me that he old pdflatex
>> documents do not work in the current TL.
>> Your documents were presumably not specifying an encoding.
>> Since the default encoding was switched to UTF-8 we have had essentially no reports of documents breaking.
>> Any document that was correctly declaring  its encoding continues to work the same way, and any old document
>> using non-ascii characters without declaring an encoding (which was possible but never supported and produced
>> good or bad results depending on the font encoding in use) can be used with a current latex by adding
>> \UseRawInputEncoding
> 
> I think the point regarding pdflatex vs a Unicode engine is pertinent for the particular cases where a multioctet character needs to be grabbed in its entirety without explicitly grouping it. This has always been an issue with utf8 and inputenc both now and historically.
> 
> For example, to sentence case some text:
> 
> \MakeUppercase ábc
> 
> This works without a problem with xelatex and lualatex but fails with pdflatex. This is a problem for glossaries.sty where I have to tell users they need to group the first letter if they want to use any sentence casing commands. This isn't intuitively obvious since visually the character looks like a single entity.
> 
> Example document:
> 
> \documentclass{article}
> 
> \usepackage{glossaries}
> 
> \newglossaryentry{elite}{name={élite},description={...}}
> 
> \begin{document}
> \Gls{elite} forces entered the building.
> \end{document}
> 
> Again, it works fine with xelatex and lualatex but not with pdflatex, which requires:
> 
> \newglossaryentry{elite}{name={{é}lite},description={}}
> 
> (Incidentally, UTF-8 now works in glossary labels. It's the sentence-casing that's the issue here.)
> 
> Some time ago I experimented with trying to grab multiple octets (in datatool-base.sty, which is used by glossaries.sty), which used to work for two-byte octets but it's stopped working now. I haven't had time to investigate, but it would be really useful if inputenc provided a way to grab all octets of a multioctet character so that sentence casing (and similar tasks) can work without having to group the initial character.
> 
> Regards
> Nicola Talbot




More information about the tex-live mailing list.