Chemical structures with plain TeX

Peter Flynn peter at silmaril.ie
Sat Jul 6 01:35:57 CEST 2019


On 05/07/2019 19:38, Shreevatsa R wrote:
> On Fri, 5 Jul 2019 at 03:47, Taylor, P <P.Taylor at rhul.ac.uk> wrote:
> 
>     As I wrote off-list to Peter :

Which has inexplicably failed to arrive so far.

>> Sometimes I just want to weep.  

I have the deepest sympathy with this...I feel exactly the same when
confronted by file formats or command structures I don't understand,
such as Python :-)

I should have make it more obvious, but I was short of time.

>> There can be no doubt, based even 
>> on just the evidence above, that the Unix operating system is a 
>> very powerful tool, 

Actually all that can be done almost identically in Windows, I think,
although I have not personally ever had any success in piping in Windows
except for the most trivial commands. Maybe in Powershell.

>> and the simple fact that one can identify all packages that do not
>> have the string "LaTeX" (presumably case-insensitive) in their CTAN
>> path is a clear demonstration of that fact.

More a clear demonstration that when the data is robustly constructed
(that is, Karl and others take good care to ensure that the same class
of information — here the directory path — is in the same place for
every entry, and that it's enclosed in the same HTML elements every time.

Without that, all bets are off.

>> And yet the entire thing is gibberish. 

Pretty much. It's a set of conventions, but it's also a set of programs
which I use daily, often many times daily, in the course of information
retrieval: download something, extract some data, use that to download
something else, extract more data from that, reorganise or rearrange it,
and use the result to do something else with. So I'm familiar with them,
and it's not reasonable to expect anyone else to be unless they're also
in the same field of information extraction.

>> It could be Mayan, for all I know.  I could stare at it for the
>> rest of my life and still not have the slightest idea how it works.
Without the manual pages to explain what the hieroglyphics mean, anyone
would be lost. Fortunately the manual pages come with the programs, so
they're built into any computer the programs are installed on.

>> Why oh why oh why does someone not come up with a command-line
>> interpreter (or as I fear you would call it, "a shell") that uses
>> English verbs as its commands and Enqlish
>> nouns/adjective/adverbs/etc as its qualifiers ?  

They have done, many many times, and continue to do so. As R Shreevatsa
explained, all the short command options have long versions, which are
all in English.

>> How on earth is anyone expected to know what "-i -o" implies,
>> especially as what it implies is almost certainly a function of the
>> command to which it is applied ?  

Yes, exactly. I'm afraid it means reading the manual.

If you type 'man tidy' to view the Tidy manual, among all the other
options it says:

       -numeric, -n (numeric-entities: yes)
              output numeric rather than named entities

which makes Tidy output (for example) &#x2108; for the ℈ sign instead of
&scruple: which is what it means (I am dealing with an 11th century
medical text at the moment). That particular option makes the resulting
document into plain ASCII text, rather than the normal UTF-8, which
makes it easier to handle in some non-XML utilities which still (!)
stumble over upper-order UTF-8 (4-char and 5-char encoding of a codepoint).

>> And why can one not apply 2>/dev/null distributively, such that it
>> applies to /all///commands in the sequence rather than having to be
>> spelled out in full for each.

That's a very good architectural question. It is certainly possible: you
can redirect stderr to /dev/null with a command which I won't give here
because (a) it's unbelievably rare that anyone would ever want to do
such a thing in the wild, and (b) it's spectacularly dangerous for many
reasons that I won't go into.

It *is* a valid thing to do in a script, where its effect will cease
when the script stops executing, much like a LaTeX variable reverting to
its previous value outside the group where it was re-set.

In this case, however, I did *not* want it redirected for lxprintf and
grep, only for the wget and tidy commands.

> I think very few people prefer to write the longer versions though.

I'm as lazy as the next person, so yes, guilty as charged: I use the
short versions because I can't remember the long ones, and I'm usually
short of time.

> Also, "2>/dev/null" could be applied distributively, by enclosing the
> whole thing in parentheses and appending "2>/dev/null" to that.
> 
> Anyway, here is a python3 script that I guess (because I couldn't
> install lxprintf either) 

It is a source distribution in GNU format so you have to zip/detar the
code, and run configure, make, and make install.

> is the equivalent of the above; hopefully it is slightly easier to
> understand:

I have the same problem with this as Phil had with my commands :-) and I
would have to guess that BeautifulSoup is some kind of package that
handles HTML parsing, possibly as XML (I don't know).

> import requests
> from bs4 import BeautifulSoup
> 
> chemistry_response = requests.get('https://ctan.org/topic/chemistry')
> chemistry_soup = BeautifulSoup(chemistry_response.text, 'html.parser')
> for link in chemistry_soup.find_all('a'):
>     href = link.get('href')
>     if href.startswith('/pkg/'):
>         uri = 'https://ctan.org' + href
>         package = BeautifulSoup(requests.get(uri).text, 'html.parser')
>         for td in package.find_all('td'):
>             if td.text == 'Sources':
>                 path = td.next_sibling.a.code.text
>                 if 'latex' not in path:
>                     print(path)

It also shows up a bug in what I wrote: the XPath predicate in the final
lxprintf command should be following-sibling::td[1]/c/code because it's
the *first immediately following* <td> we're interested in, not any or
all of them, just the first one.
> Of course all this does is replace Unix programs written by different
> people with Python packages (libraries) written by different people; so
> it may not be any better.

It's a better quality of script than what I bashed out at the terminal,
and it's not really any much longer. It assumes Python skillz; mine
assumes Unix skillz. You could do the same in lots of languages, but the
key reason I use Tidy is that I am *guaranteed* to get an XML file to
work with, so (because of the reliability I mentioned earlier) I then
*know* that the data I want will be in the place I specify, something
impossible to know if you process HTML in the slapdash way it is usually
written (CTAN is an exception). The XML Guarantee does two things here:

a) it guarantees that if a document is broken, it will fail the parse,
and won't go blindly charging ahead, making unwarranted assertions about
where stuff is supposed to be, based on what may now be wholly corrupt
data. Only well-formed documents are presented for processing, so...

b) *any* file which passes the test can be processed by *any*
XML-conformant software. So I can hand the file to an extractor like
lxprintf, or to an XML editor for human intervention, or to an XSLT
process for conversion to something else (eg LaTeX), or any of hundreds
of other things one might want to do.

Like I said, I'm lazy. I don't want malformed data, and this is just a
part of the filtering apparatus I use.

Says nothing for the quality of the *data content*, of course (human
problem) but at least if there are errors, they're traceable.

Peter



More information about the texhax mailing list