[tex4ht] How to get PDF's page numbers in html output? (Accessibility issue)

CV Radhakrishnan cvr at river-valley.org
Sat Dec 24 02:14:50 CET 2011

On Friday 23 December 2011 05:11 AM, Herbert Sitz wrote:
> Susan Jolly<easjolly at ...>  writes:
>> This poster's question is significant for accessibility.  Braille, large
>> print, speech, and other accessible versions of print editions typically use
>> the page numbers of the (base) print (paged media) edition to allow users of
>> accessible documents to communicate with each other and with users of the
>> print edition.  While I appreciate that the concept of "page number" is
>> somewhat meaningless when using eReaders, the accessibility community has
>> not AFAIK addressed alternative solutions. So at least in the forseeable
>> future this is a capability that accessible media producers need.
> Susan --
> Good point, which I hadn't though of.  My own query is driven by a slightly
> different but related need:  an academic setting where students may be using
> ebook, html, and/or pdf versions.  Without having some kind of location-based
> counter common to the text of all versions there's no good way for users of
> different versions to refer to reference location of a particular passage.

In an online world, the concept and usefulness of pages disappear. 
TeX4ht works on this basic premises. Also, we have different devices 
with different geometry which make pages literally useless when html or 
XML based markup allows re-flowing of text unlike the rigid PDF. So, 
instead of making PDF with rigid margins as the definitive version, we 
have to return to the wisdom of our forefathers who created Bible and 
formatted it for different geometries, still retaining the ability to 
refer to any verse, chapter, line, etc., across all different versions 
in a uniform way.

So, the best option is to keep paragraph numbers instead of page numbers 
as the basis for reference.

> The counter doesn't need to be the pdf page number, but that's an
> already-existing counter that makes sense.  Whatever counter is used, it must be
> present in all versions.
> In looking further at tex4ht I'm not sure merely having ability to insert a
> counter at page breaks would solve this problem.  I have .tex files that I
> process to PDF using pdflatex, and which I process with tex4ht's htlatex to get
> the html.

TeX4ht's page break has no relation with that in the corresponding PDF. 
If you create an xhtml or XML with MathML, you get a dvi of many hundred 
pages for a single page document! Most of the pages will be having a 
single character text only. So, relying on TeX4ht pagecbreak does not 
take you anywhere.

> The problem I see is that tex4ht alters the formatting in the process of
> generating the html.  tex4ht first compiles the document to an intermediate dvi,
> then uses that dvi to generate the html.

That is correct.

> I had expected the pagination of the
> dvi file to correspond to the pagination of the pdf generated by pdflatex.

Unfortunately, No. Formatting the document is not the objective, but 
translating from one markup to another markup is the objective where 
formatting of text to look exactly like in a printable version is hardly 
necessary to accomplish TeX4ht's objective.

> Unfortunately, the pagination does not necessarily match.  I'm not sure what

It won't match at all.

> formatting changes tex4ht makes as part of compiling to dvi (besides disabling

The dvi is a convenient file format for TeX4ht's post-processor to 
extract text and markup injected into the dvi as \special's. And it 
provides a nicer way to replace and/or manipulate any character in any 
manner with the help of Eitan's ingenious hypertext fonts.

> header and footer, which would not necessarily affect pagination).  So merely
> having ability to hook in and put in a page counter for each new dvi page would
> not necessarily give pagination markers that correspond to the PDF.

The entire page breaks and line breaks differ. All attributes like bold, 
italic, large, etc have lost their meaning found in pdf, but have a 
different meaning and different markup system palatable to the browser. 
When glue, vertical and horizontal skips, character widths/heights lose 
their meaning in TeX4ht generated dvi, it is clear that we will seldom 
get an output which visually corresponds to pdf output if printed.

> I wonder whether there are some optional settings in tex4ht that would make the
> dvi pagination match (or even closely match) the pagination in the PDF.

Sorry, I don't think that will happen.

> I see a non-tex4ht-related way to generate the page numbers I want in the html,
> but it's not trivial.  Basically, the steps would be these:

Sure, that is the best option. I too am interested to see a better solution.


> Does anyone know whether there is a publicly available solution for this?  I
> would probably write it in Python using the BeautifulSoup html api; I wonder
> whether something like this is already available on github or elsewhere.
> Or maybe there actually is some way to get tex4ht to (1) generate dvi with
> pagination that corresponds to PDF pagination, and (2) include a page counter in
> the html that corresponds to the PDF pages.

This is an impossibility in TeX4ht as far as I know.

Best regards

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4490 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://tug.org/pipermail/tex4ht/attachments/20111224/761e8dec/attachment.bin>

More information about the tex4ht mailing list