Page numbering revisited

Wed Sep 21 14:11:04 CEST 2022

In:
  https://tug.org/TUGboat/tb42-1/tb130inn.pdf#page=2
there is a discussion about how to extract page numbers.

Given a pdf file and trying to find page numbers might be easily solved 
with:

$ pdftohtml -stdout -xml tb130inn.pdf 
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">

<pdf2xml producer="poppler" version="22.06.0">
<page number="1" position="absolute" top="0" left="0" height="1188" width="918">
        <fontspec id="0" size="15" family="MWORYW+SFRM1000" color="#000000"/>
        <fontspec id="1" size="15" family="SYWLYT+SFBX1000" color="#000000"/>
        <fontspec id="2" size="13" family="YEIBGM+SFRM0900" color="#000000"/>
        <fontspec id="3" size="10" family="YWBOHZ+SFRM0700" color="#000000"/>
        <fontspec id="4" size="15" family="OZXCPT+SFSS1000" color="#000000"/>
        <fontspec id="5" size="15" family="NZNQMF+SFTT1000" color="#000000"/>
        <fontspec id="6" size="15" family="BMPJYV+CMSY10" color="#000000"/>
        <fontspec id="7" size="13" family="UPNUCG+SFTT0900" color="#000000"/>
        <fontspec id="8" size="13" family="DZLDDJ+CMSY9" color="#000000"/>
        <fontspec id="9" size="13" family="SIYDGO+SFBX0900" color="#000000"/>
        <fontspec id="10" size="15" family="QWFIUY+SFTI1000" color="#000000"/>
        <fontspec id="11" size="9" family="XYXHZH+SFRM0600" color="#000000"/>
        <fontspec id="12" size="12" family="WTODAM+SFRM0800" color="#000000"/>
        <fontspec id="13" size="10" family="LDOGYZ+SFTT0800" color="#000000"/>
<text top="71" left="108" width="15" height="13" font="0">18</text>
<text top="71" left="575" width="232" height="13" font="0">TUGboat, Volume 42 (2021), No. 1</text>
<text top="120" left="107" width="140" height="13" font="1">Typographers? Inn</text>
...
<page number="2" position="absolute" top="0" left="0" height="1188" width="918">
...
<text top="71" left="108" width="232" height="13" font="0">TUGboat, Volume 42 (2021), No. 1</text>
<text top="71" left="792" width="15" height="13" font="0">19</text>
...

With this you can easily sort the text into pages and lines.
We can also identify that this is has a left/right page header.

Page header starts on y-pos 71, left edge is 108,
right edge is 575+232 = 792+15 = 807
The header has two parts:
. a constant text "TUGboat ..."
. a number

I call headers like this for a page invariant and you can detect
theese invariants due to their same position and content.

After the inveriants are detected, you can find the variant text
and possible find that in a latex source.

///

Example code:
 http://aspodata.se/git/openhw/pdftosym/pdfextr.pl

I have used that code to extract
. tables in ic datasheets to find pin name and labels
. extracting contents from pdf invoices

Unfortunately, since pdf's doesn't contain much semantic hints, you
have to add a subroutine for each specific group of cases.

///

There might be a better tool to extract things from pdf, but I forgot
its name. It could give me data on lines used in the pdf file, which
could make the prediction on table cells contents better. Also missing
from pdftohtml is text direction.

Regards,
/Karl Hammar