[tex-live] catalogue metadata usage in texdoc (was: texdoc index)

Fri Dec 30 02:57:34 CET 2011

On 2011-12-28 at 04:09:52 +0100, Joachim Schrod wrote:

 > Reinhard Kotucha wrote:
 > 
 > Reinhard, sorry for intruding in your thread... ;-)

You are welcome.  It's not _my_ thread. :)

 > >   texdoc debug-score: Start scoring /usr/local/texlive/2011/texmf-dist/doc/generic/vntex/vntex.pdf
 > >   texdoc debug-score: Catalogue details bonus: +1.5
 > 
 > Is there any real-world data when catalogue details are wrong?
 > (Not: wrong as in not available. But: Wrong as in Available but
 > incorrect.)
 > Why is the bonus just +1.5 and not more?
 > 
 > Tonight, I had a CTAN work dinner and we had exactly that
 > discussion. I'm on record (there have been witnesses :-) to make
 > the bold statement "Robin is just too good. The catalogue meta-data
 > concerning available documentation can be relied upon. In practice,
 > it doesn't produce false positives." Am I wrong?

I must admit that I don't have an overview at all.  Neither about the
catalogue nor about the documentation on CTAN or in TeX Live.

The nasty thing is that it's not sufficient to investigate things by
scripts.  It's often necessary to look into the files in order to see
what they contain, which is time-consuming.  I suppose that Robin can
tell you more about the current state of the catalogue.

As far as texdoc is concerned, relying entirely on the catalogue is
probably a bit problematic because texdoc doesn't use the catalogue
directly.  The catalogue is consulted when a TeX Live package is
created and information is passed from the catalogue to the TeX Live
database.

This means that if Robin fixes a bug in the catalogue, it only has an
effect when the package is updated.  Thus, users have to be forced to
download the whole package again though nothing in the package itself
had been changed.

There is one thing which does not allow texdoc to rely on the
catalogue entirely:  There are sometimes several files in a package
which have the "Package manual" tag but texdoc still has to decide
which file is the most appropriate one.

Look at entries/v/vntex.xml, for example.  There are two files,
vntex.pdf and vntex-man.pdf.  Both have the tag "Package manual".
texdoc still has to guess from the file name which one is more
appropriate.

If texdoc could rely entirely on the catalogue, that would be great.
But there are a few things which have to be done before.  What's still
missing is a clear specification.  For instance, we need tags which
denote which file should be displayed by texdoc.  Sure, there can be
only one such tag in a particular package, except if one file is an
exact translation of another one.  texdoc could then make use of the
language tags and maybe select the file according the current locale
setting.  At the moment I don't see any possibility to support
locales.

Anyway, IMO it would be shameless to ask Robin to add those tags, and
of course, it's even more shameless to ask him before a clear concept
exists.

If texdoc is supposed to rely on the catalogue entirely, a few things
have to be done.  First we need clear specifications.  Maybe we need a
small working group.  Then we have to think about CTAN itself.  Since
Jim left the team, the situation is more than unfortunate and we can't
expect that Robin does more than he already does.

I'm wondering how Robin can be disburdened but I fear that the most
time-consuming part is communication with package authors.  As far as
XML tags are concerned, the CTAN web interface could ask authors
which file in the doc tree is the one supposed to be found by texdoc
in the first place when they upload a package.  But this isn't a
solution either due to all the old packages which never get updated.
Thus, a lot of manual work is required.

Finding new CTAN maintainers is difficult because an enourmous amount
of knowledge is required.  But the catalogue is just a subset of CTAN
and probably (hopefully) it might be a bit easier to find volunteers.

Assume that texdoc eventually will be able to rely entirely on the
catalogue (let me dream a little bit).

The catalogue data can be converted from XML to pre-compiled Lua data
structures on the TUG server.  Lua's basic data type is an associative
array, thus it's quite natural to represent XML data in Lua.  Of
course, the catalogue is huge, but a pre-compiled Lua file is loaded
30 times faster than pure Lua (ASCII) code.  If we can rely entirely
on the catalogue, then texdoc will be extremely fast.

Currently texdoc mainly depends on pattern matching and regular
expressions, which is much less efficient than loading a pre-compiled
data structure (in TeX parlance: a format file).

Anyway, I refuse to bother Robin before we have clear specifications.
Maybe it's necessary to establish a dedicated working group.  And we
also have to make sure that everything we do doesn't cause extra work
for Robin.  He already spends all his spare time on package updates,
it would be shameless to expect more.

Regards,
  Reinhard

-- 
----------------------------------------------------------------------------
Reinhard Kotucha                                      Phone: +49-511-3373112
Marschnerstr. 25
D-30167 Hannover                              mailto:reinhard.kotucha at web.de
----------------------------------------------------------------------------
Microsoft isn't the answer. Microsoft is the question, and the answer is NO.
----------------------------------------------------------------------------