search accents in pdf generated by TeX

Ulrike Fischer news3 at nililand.de
Fri Jan 28 09:11:17 CET 2022


Am Thu, 27 Jan 2022 21:55:32 -0800 schrieb William F Hammond via
texhax:

> First, I don't know what the statement "accented letters are
> not recognized by the pdf" means.  If we're talking about
> typesetting with pdftex, then I think that the PDF output is
> UTF-8 encoded. 

No.

> If one runs the program "pdftotext", which
> is part of an Ubuntu package called poppler-utils on my
> Ubuntu platform, the output text is UTF-8 encoded.  I think
> that text TeX's algorithmic accents are implemented using
> Unicode combining characters. 

No, not with pdftex. If you compile

\documentclass{article}

\begin{document}
ä ö ü é è
\end{document}

and then copy and paste you will get

    ¨a ¨o ¨u ´e `e

that is 
   
    U+00A8a U+00A8o U+00A8u U+00B4e U+0060e

(U+00A8 is for example diaresis).

So no combining accents involved. And pdf viewer typically can't
search for this accented chars and you can't copy this in other
applications.

if you add \usepackage[T1]{fontenc} and so use a font which has the
needed glyphs then you get the correct unicode code points and a
searchable pdf

     ä ö ü é è


-- 
Ulrike Fischer 
http://www.troubleshooting-tex.de/



More information about the texhax mailing list.