[LaTeX] Re: beamer and inputenc (utf8x) issue

Fri Jul 1 21:56:38 CEST 2022

>>> If your input is in UTF8, it is better to use an engine working
>>> internaly in Unicode, i.e luatex or xetex.
>>
>> Millions of documents use utf8 with pdflatex without problems. I run
>> 95% of my documents with pdflatex. They are all utf8-encode and as
>> I'm german my texts do contain umlauts and other non-ascii chars.
>> The only thing that pdflatex can't handle are combining accents.
>>
>>
> I am not that lucky, most of my old pdflatex documents fail. I often
> have \everypar containing a macro with one parameter for setting an
> initial. If I output it as {\otherfont #1} and the token is the first
> octet of a multioctet character, it fails. Character V as an initial
> needs an extra kern thus if the macro contains \if#1V and #1 is the
> first octet of a multioctet character, it fails. I often use
> \futurelet\testchar\dosomething and if \testchar becomes the first
> octet of a multioctet character, \dosomething fails. And it happens
> even without hyperref. I stopped using pdflatex a few years ago. Now I
> have 15 versions of TeX Live installed and when I have to recompile an
> old document, I go back in history and try, in which version of TL the
> document works. It is quite common for me that he old pdflatex
> documents do not work in the current TL.
> 
> Your documents were presumably not specifying an encoding.
> 
> Since the default encoding was switched to UTF-8 we have had essentially no reports of documents breaking.
> Any document that was correctly declaring  its encoding continues to work the same way, and any old document
> using non-ascii characters without declaring an encoding (which was possible but never supported and produced
> good or bad results depending on the font encoding in use) can be used with a current latex by adding
> \UseRawInputEncoding

I think the point regarding pdflatex vs a Unicode engine is pertinent 
for the particular cases where a multioctet character needs to be 
grabbed in its entirety without explicitly grouping it. This has always 
been an issue with utf8 and inputenc both now and historically.

For example, to sentence case some text:

\MakeUppercase ábc

This works without a problem with xelatex and lualatex but fails with 
pdflatex. This is a problem for glossaries.sty where I have to tell 
users they need to group the first letter if they want to use any 
sentence casing commands. This isn't intuitively obvious since visually 
the character looks like a single entity.

Example document:

\documentclass{article}

\usepackage{glossaries}

\newglossaryentry{elite}{name={élite},description={...}}

\begin{document}
\Gls{elite} forces entered the building.
\end{document}

Again, it works fine with xelatex and lualatex but not with pdflatex, 
which requires:

\newglossaryentry{elite}{name={{é}lite},description={}}

(Incidentally, UTF-8 now works in glossary labels. It's the 
sentence-casing that's the issue here.)

Some time ago I experimented with trying to grab multiple octets (in 
datatool-base.sty, which is used by glossaries.sty), which used to work 
for two-byte octets but it's stopped working now. I haven't had time to 
investigate, but it would be really useful if inputenc provided a way to 
grab all octets of a multioctet character so that sentence casing (and 
similar tasks) can work without having to group the initial character.

Regards
Nicola Talbot