search accents in pdf generated by TeX

William F Hammond hmwlfsr at yahoo.com
Fri Jan 28 06:55:32 CET 2022


Ulrike Fischer <news3 at nililand.de> writes:

> Am Fri, 21 Jan 2022 00:53:02 +0100 schrieb Gérald Tenenbaum:
>
>> Hello,
>> 
>> I come across  an unexpected issue with the pdf file of a book written 
>> with eplain. I turns out that accented letters are not recognized by the 
>> pdf. is there anything I can modify either in the source or in the pdf 
>> to make accents recognized?
>
> Well in LaTeX it works (if you use T1 encoding). So yes it is
> possible by using the right font (perhaps you need also a cmap or
> glyphtounicode). 
>
> But I don't know if there exist some premade support in eplain. 

I'm really replying to Gérald Tenenbaum.

First, I don't know what the statement "accented letters are
not recognized by the pdf" means.  If we're talking about
typesetting with pdftex, then I think that the PDF output is
UTF-8 encoded.  If one runs the program "pdftotext", which
is part of an Ubuntu package called poppler-utils on my
Ubuntu platform, the output text is UTF-8 encoded.  I think
that text TeX's algorithmic accents are implemented using
Unicode combining characters.  Thus \'e in TeX source
becomes an e followed by U+0301 (which in UTF-8 encoding is
0xCC0x81).  I believe the same statement applies if xetex is
used.

On the other hand, xetex assumes by default that the TeX
source is UTF-8 encoded unicode.  It's still OK to enter
\'e, and if you do, you'll get the combining acute accent.
If you want to get U+00E9 in PDF, which is the real thing,
then enter é.  (It's easy enough to change, for example,
with GNU Emacs using "query-replace".)  You'll need to
specify a suitable unicode opentype (or truetype) font such
as "Latin Modern Roman".  Note: I think I've seen trouble
with xetex when one tries to mix things like \"o with real
unicode characters, say, ü.

Another method, more devious, that allows you to keep \'e is
to wrap the plain code, with the loading of a suitable
unicode font, in LaTeX.  Thus, you can run the following
through xelatex (not xetex).  Since I seldom use plain, I've
not tested this approach very much at all.

\documentclass{minimal}
\begin{document}
% plain wrapped in LaTeX
\font\lmr="Latin Modern Roman"
\lmr

This is e-acute: \'e.
\end{document}

For a more elaborate example, see my blog
https://mathbygellmu.blogspot.com/2022/01/xetexAccentHandling.html


                         -- Bill


Email: hmwlfsr at yahoo.com
       gellmu at gmail.com
https://www.facebook.com/william.f.hammond
http://www.albany.edu/~hammond/




More information about the texhax mailing list.