search accents in pdf generated by TeX

Ulrike Fischer news3 at nililand.de
Sat Jan 29 13:17:24 CET 2022


Am Fri, 28 Jan 2022 22:20:03 -0800 schrieb William F Hammond via
texhax:


>>> First, I don't know what the statement "accented letters are
>>> not recognized by the pdf" means.  If we're talking about
>>> typesetting with pdftex, then I think that the PDF output is
>>> UTF-8 encoded. 
>>
>> No.
> 
> I do understand that a PDF file is not a text file if that
> is why you are saying "no".  But ...
> 
>>> If one runs the program "pdftotext", which
>>> is part of an Ubuntu package called poppler-utils on my
>>> Ubuntu platform, the output text is UTF-8 encoded.  I think
>>> that text TeX's algorithmic accents are implemented using
>                    ^^^^^^^^^^^
>>> Unicode combining characters. 

That a converter outputs utf8 doesn't mean that it is there. You can
also convert pdf to html but that doesn't mean html is in the pdf. 

In a pdf only a few strings use something related to utf8 encoding
and none of them are related to text output. 


>>
>> No, not with pdftex. If you compile
>>
>> \documentclass{article}
>>
>> \begin{document}
>> ä ö ü é è
>   ^^^^^^^^^
> 
> When I said that I was using TeX's *algorithmic* accents,
> I meant  \"a \"o \"u \'e \`e
> 

That doesn't matter. LaTeX will process this as ä -> \"a. You would
get the same if you input with commands.

> But, anyway, the original question was about
> plain TeX, not LaTeX, and I was trying to address
> that.

The core problems are quite similar in plain. 

> 
>> \end{document}
>>
>> and then copy and paste you will get
>>
>>     ¨a ¨o ¨u ´e `e
>>
>> that is 
>>    
>>     U+00A8a U+00A8o U+00A8u U+00B4e U+0060e
>>
>> (U+00A8 is for example diaresis).
> 
> I have not been able to duplicate what you say.
> 
> Perhaps you have a newer version of pdftex.  Mine
> is from TeXLive 2017.  But I doubt if that is the
> explanation.  Perhaps there are differences in
> "locale" arrangements.

There should be no difference in texlive 2017 here.


> 
>> if you add \usepackage[T1]{fontenc} and so use a font
>> which has the needed glyphs then you get the correct
>> unicode code points and a searchable pdf
>>
>>      ä ö ü é è
> 
> With what input encoding for your LaTeX source? 

utf8. If you have an older latex you will perhaps have to declare it
with

\usepackage[utf8]{inputenc}. In newer latex that is the default.


> Are you saying that you have an arrangement for pdflatex to read
> unicode up U+00FF when text-encoded as UTF-8 (as in your
> last email) ??

Yes. 


-- 
Ulrike Fischer 
http://www.troubleshooting-tex.de/



More information about the texhax mailing list.