[texhax] Radical Philosophy html file ref needs correction

Carlos linguafalsa at gmail.com
Mon Apr 30 13:22:12 CEST 2018


On Mon, Apr 30, 2018 at 10:41:43AM +0100, Daniel Nemenyi wrote:
>Hello Carlos!
>
>Carlos writes:
>
>> I was just browsing the texhax archives. There was a thread about a
>> publication named Radical Philosophy.
>
>I'm the person behind Radical Philosophy's migration to LaTeX. Nice to
>see it discussed again on texhax :)
>

Hi Daniel.

>> There's a disclosure on the page on the article that "The following text
>> has been automatically reproduced by an Optical Character Recognition
>> (OCR) algorithm. It may not have been checked over by human eyes. For
>> matters of precision please consult the original pdf."
>>
>> The article I'm referring to is at https://www.radicalphilosophy.com/article/a-monument-to-the-unknown-worker
>>
>> But even the source shows a
>>
>> <a href="#ref- -b" id="ref- -a" class="reflink body">[ ]</a>
>>
>> which throws off the footnotes. And don't we like seeing footnotes on
>> all TeX produced materials? Eh? hehe.
>
>Actually what happened was that we built a script to extract html from
>the original PDFs of our 45 year old archive, rather than recycling the
>html we had already made for some of them. So the old html from the
>archive site with the "#ref-" style footnotes was not the source.
>
>The PDFs of Radical Philosophy were created in inDesign since the 1990s
>until we moved to LaTeX, and before that by god knows what -- the early
>ones from the 1970s were typewriter, scissor and glue jobs. For this

Look at that. What a coincidence. Yep. I must have
been using scissors during those years, and egg
whites at home for kindergarten projects - due to
scarcity of glue, that is. Still the output must
have been similar, right?

>migration we used pdftohtml to extract html from the inDesign PDFs, and
>pdftotext for the rest. We couldn't work out how to prevent the output
>of pdftohtml from being filled with noise and excessive html tags so we
>used... a lot of SED to clear things up! Probably should have taken a
>structured XML parsing approach, but anyway. I enjoy the Rubik's Cube
>quality of SED. And we added things like reference and waybackmachine
>links.
>
>The quality of it all... varies... some of it is fine, some of it is
>fine for a search engine but not a human. The pdftotext output

I don't know about the fine part for a search
engine either. It's all relative to the search
engine itself and the tools used therein. Some
engines prioritize their own projects.

>especially isn't very good. Maybe we should re-ocr the whole lot and try
>again, but probably we'll correct things manually as we go along. So for
>all of these items we put up that warning.
>
>As for the new content produced by LaTeX, that of course coverts really
>nicely into html. Via a wrapper on the admin panel of our Wordpress
>site, we use pandoc to convert our tex files to html, and we use a bit
>of SED to standardise our tex files before submitting it to pandoc,
>since pandoc can be quite sensitive:
>
>exec("sed -i " . $filenametmp . " -e 's/\\includegraphics\[[htbH]]*]/\includegraphics/g;' \
>                                  -e 's/\\begin{figure\*}\[[^]*\]//g;' \
>                                  -e 's/\\end{figure\*}//g;' ");
>
>And some more SED once its out the other end:
>
>exec("sed -i " . $htmlfilenametmp . " -e 's/\[[ht]*\]//g;' \
>                                      -e 's/<hr \/>/<h2 class=\"notes\">Notes<\/h2>/g;' \
>                                      -e 's/<span>2<\/span>//g;' \
>                                      -e 's/<p><\/p>//g;' \
>                                      -e 's/<li id=\"fn/<li class=\"footnote\" id=\"fn/' \
>                                      -e 's/class=\"emoji\"/class=\"reflink reffoot\"/g;' \
>                                      -e 's/<a href=\"#fn/ <a href=\"#fn/g;' \
>                                      -e 's/↩/^/g;' \
>                                      -e 's/>^<\/a>/ class=\"reffoot footnoteLink\">^<\/a>/g' \
>                                      -e 's/><sup>/>/g;' \
>                                      -e 's/<\/sup></</g;' \
>                                      -e 's/class=\"footnoteRef\"/class=\"footnoteRef footnoteLink\"/g;' \
>                                      -e 's/<h2/<h3/g;'  \
>                                      -e 's/<\/h2>/<\/h3>/g;' \
>                                      -e 's/<img /<!--<img /g;' \
>                                      -e 's/ height=\"[0-9]*\"//g;' \
>                                      -e 's/ width=\"[0-9]*\"//g;' \
>                                      -e 's/ alt=\"image\" \/>/\/>-->/g;' \
>                                      -e 's/<p>[ ]*<!--/<!--/g;' \
>                                      -e 's/-->[ ]*<\/p>/-->/g;' \
>                                    ");
>

Look at that!

I'd wished I would have thought of having
something like that with a personal blog of mine.

The markdown markup was lost at one point due to
the careless handling from the devs. The lack of
documentation from the project, also piled up to
the problem, so I ended up cleaning up after, one
inconsistency at a time.

>We're in the slow process of putting our codebases up on
>github... Anyway, don't know if any of that reply overkill will be
>useful to anyone out there.
>
>> Anyhow. Interesting article. Interesting. I enjoyed reading most of it.
>
>Glad you enjoyed /most/ of the article ;)
>

I sure did. I read Bolaño's Savage Detectives and
also some of his articles' excerpts that appeared on a
literary magazine somewhere off Boston I think. 

A rare breed of a writer, like someone said once. 

He couldn't have thought of having a better
translator than ? Natasha Wimmer, I think is the
name.

My personal opinion is that he's underrated and
will always be - unfortunately - under the shadows
of Garcia Marquez, when he shouldn't be.

>Daniel
>
>>
>> Here's the git diff on the file
>>
>> --- a/bolano_original.html
>> +++ b/Bolano/bolano_add_ref.html
>> @@ -25,7 +25,8 @@ Argentine Ricardo Piglia) and postmagical realist (after, for example,
>>  the Paraguayan Augusto Roa Bastos) cognitive mapping. In doing so,
>>  <i>2666 </i>suggests, in a kind of high-modernist vein, an
>>  out-of-kilter realism re-presenting reality – that is, a capitalist
>> -world – gone awry.  Bolaño’s novel <i>2666</i> is an inorganic work
>> +world – gone awry. <a href=#fn2" id="fnref2" class="footnoteRef
>> +footnoteLink">[2]</a> Bolaño’s novel <i>2666</i> is an inorganic work
>>  written in five ‘parts’, a quintet that does not quite make a whole,
>>  and whose unity is given paradoxically in narrative proliferation and
>>  dispersal.  <a href="#fn3" id="fnref3" class="footnoteRef
>>
>>
>> Thanks a lot.
>> _______________________________________________
>> TeX FAQ: http://www.tex.ac.uk/faq
>> Mailing list archives: http://tug.org/pipermail/texhax/
>> More links: http://tug.org/begin.html
>>
>> Automated subscription management: http://tug.org/mailman/listinfo/texhax
>> Human mailing list managers: postmaster at tug.org


More information about the texhax mailing list