[tex-live] Better ways to find packages and documentation [was: texdoc in luatex]

Florent Rougon f.rougon at free.fr
Tue Jul 3 16:52:47 CEST 2007


Hi,

I took a little bit of time to answer, because I wanted to look at the
new TL infrastructure before (and also because I'm trying to have a life
:).

Norbert Preining <preining at logic.at> wrote:

> puhhh, many thing are floating around. In fact it would be nice to
> discuss all this stuff in person, which would make it much easier!

Do you want to visit Paris? :)

> Well, this should work already now more or less, at least for those
> packages where we know the TeX Live name <-> TeX Catalogue name
> translation. Let's put it this way, the Catalogue has a field
> 	<texlive location='foobar'>

hyperref.xml has:

  <texlive/>                                                                    
  <miktex location='hyperref'/>                                                 
  <tetex/>                                                                      

i.e., the texlive element is empty. Huh? Maybe because it's in texmf and
not texmf-dist?...

Anyway, it contains only one path component, what can I do with that?...

(not sure it matters much for our purpose, so feel free not to answer)

> which gives the respective TLPOBJ (ex tpm) to which is corresponds.

IMHO, this is not precise enough because the link here is at (CTAN/TL)
package level, not file level. However, the doc files can have different
names depending on the format chosen, and I want to be able to carry the
metadata from CTAN about these files even when CTAN and TL have chosen a
different format/file name.

For instance, if CTAN ships some doc in HTML format and TL the same doc
in PDF format, we could have in the catalogue:

  ...
  <documentation details='blah blah' language='de'
                 href='ctan:/macros/.../foo/index.de.html'>
  ...

and you would have in your current texlive.tlpdb:

  docfiles size=******
   ...
   texmf-dist/doc/latex/.../foo-de.pdf
   ...

When I read this texlive.tlpdb excerpt, it is not at all obvious (for a
program) that foo-de.pdf has the attribute language='de'.

One way out of this is to specify an id for each document, as in:

  <documentation id='foo-user-guide' details='blah blah' language='de'
                 href='ctan:/macros/.../foo/index.de.html'>

and for texlive.tlpdb:

  docfiles size=394380
   ...
   foo-user-guide texmf-dist/doc/latex/.../foo-de.pdf
   ...

Then, if I have the catalogue XML files somewhere for my program to
read, I can trace foo-de.pdf to the corresponding <documentation>
element in the Catalogue and deduce that it has language='de'.

> Unfortunately we don't have a back mapping built into TeX Live, or at

[from TLPOBJ name to CTAN package name]

Huh?

I thought that in texlive.tlpdb:
  - either the 'name' field indicates the CTAN package name;
  - or there is a 'catalogue' field indicating the CTAN package name.

> least only partly into the ctan2tl script. All this could be simplified
> in some way (but who? is writing all the code!).

FWIW, I couldn't find ctan2tl by browsing the TL repository through the
web interface. Where is it?

> So in principle we can do:
> 	CTAN package -> get texlive location ->
> 	-> get TLPOBJ from texlive.tlpdb -> get docfiles from this
> I can write you a perl script in 3 minutes that does this. (spits out
> list of files, nothing else)

Yup, easy, but I want the Catalogue metadata carried with each doc file.

> Florent, can you send me what you want? Ie format of the output:
> Something like
> 	$ get-florent-stuff "ctan package name"
> 	...
> 	format of return stuff to be specified
> 	...

I don't think this is the best interface, because:
  1. In this case, I need to have two lists:
      * all CTAN packages
      * all CTAN packages installed on the system

     otherwise I won't be able to show the user:
      * what can be found on CTAN
      * what is already at his disposal

  2. This will be slow if invoked for each CTAN package (at least one
     process each time...). Of course, this is not a big problem if you
     intend to run this only at TL installation to generate a big file
     containing all the info for CTAN packages in TL. My program could
     then use this big file.

I'm afraid I cannot give you right now the full spec, because there are
still design questions that need anwsers. The data layouts I can imagine
are greatly dependent on these answers.

1) Can a given CTAN package be split among several TEXMF trees (in TL,
   in MiKTeX, etc.)? Or rather, do we want to support that?

   From a quick look at texlive.tlpdb, I have the impression that the
   answer is "no". For instance, geometry's documentation is not in
   texmf-doc, but in texmf-dist with all other files from the geometry
   package.

2) Do you want the data for TL-available packages to be split into
   individual files for each package, or gathered into big files?

   Advantage for the split version: it's easier to register/unregister a
   package by distributors: just add or remove the corresponding files.

   Disadvantage: takes more space, clutters the filesystem.

3) Do you want to reduce data redundancy as much as possible?

   I have the impression your answer is yes, in which case I'll need a
   copy of the Catalogue somewhere on the filesystem to lookup the
   metadata for each CTAN package and for each documentation file.

   If the answer to question 1 is no, we can be split the Catalogue
   among the various TEXMF trees: each TEXMF tree would have the part of
   the Catalogue corresponding to packages installed in that tree.

   If the answer is yes, then I need one copy of the Catalogue somewhere
   and I have to make it easy for third parties (users, distributors) to
   extend its data when they install a package that is not referenced in
   the copy of the Catalogue they have on disk (can usually be done by
   dropping files in a directory).

4) Do you want to edit files in-place when a package is added or
   removed?

   (similar to question 2, but not for available TL packages, rather to
   tell whether a given package is installed or not).

   My tool needs to be able to tell whether a package is installed or
   not. This way, the user can choose to either browse the whole TL
   distribution (when looking for some new package to install), or only
   what he has installed so far (e.g., when looking for the
   documentation of a package he knows has installed).

   There are several ways to embed this information. It can be an
   "installed=yes/no" attribute in an XML file, or an

     Installed: yes/no
  
   line in an RFC-2822-style file, or it can be done by dropping or
   removing a file in a known directory with the CTAN package name as
   basename.

   The first two ways are compact but a bit cumbersome for installers
   (TL package installer, Debian package maintainer scripts); the second
   way creates a lot of ridiculous files, but is very easy to handle for
   installers.

5) Do we want to be able to tag individual documentation files, or only
   CTAN packages?

   Tagging individual doc files is precise but more complex for the DTD
   used in the Catalogue and for my tool. One would have tags both under
   <entry> elements (these would be package tags) and under
   <documentation> elements (tags for individual documentation files).
   Maybe this would also call for two vocabularies (where a "vocabulary"
   is a set of legal tags in a given context): one for packages and one
   for doc files.

   It is quite possible that we don't need to go so far as tagging
   individual doc files:

     - for ordinary packages, I believe it's enough to have the tags
       guide the user to the right package, from which point he can get
       a (usually short) list of all doc files and choose which one to
       open.

     - for documentation that is not really tied to a LaTeX package,
       such as general tutorials, things such as clsguide, etc., I have
       the impression that there is always a CTAN package containing the
       documentation.

       If the CTAN package contains only one document (or several
       documents, but all about the same subject), then tagging the
       package is enough---no need to tag each doc file.

       But in case the CTAN package contains a mixmatch of various
       documents, then tagging only at CTAN package level may be too
       imprecise. Sure, the package can get the union of all tags that
       would apply to each document... but this is imprecise.

       Either we accept that (because such packages are rare, or because
       choosing from a list of 10 documents is deemed acceptable), or we
       don't. If we don't, there are two possibilities:
          - tag individual documents, not only CTAN packages (see above);
          - or split such CTAN packages so that each CTAN package is
            specific enough for its tags to be relevant.

With an anwser to all these questions, I should be able to propose a
relatively precise specification. Unless I forgot questions. :-P

(well, there is another question as I see from the rest of your mail: do
you prefer XML or RFC-2822 format? I saw you have some grief about XML,
but for structured data, it is far superior to RFC-2822, so in some
cases, there is not point asking the question; for simple stuff, yes, I
can consider RFC-2822. XML is also very good when it comes to encodings.
Keep this is mind :).

>> A lot of work, for sure...
>
> And in need of a *good* programmer like you to help a bit ;-))))

Sorry, I am genetically unsuited to work in Perl. ;-)

> Well, we dicussed this for quite some time when we rewrote the TeX Live
> infrastructure. We had this TPM file, and of course those could be
> enrichted with other information. Out of various reasons we wanted to
> separate real content from generated content, and NOT to have 2000+
> separate files (the installers had problems because they needed to read
> all those files).

You mean, there will be an unacceptable performance hit if anything in
this design causes a program to read one file per CTAN package? Because
of DVD head movements and things like that?

>                    So the new infra has package source files providing
> the absolute minimum on information, and from this on can generate
> package object files containing: metadata like installation
> instructions, descriptions (currently missing, should be done from
> Catalogue), list of files in 4 categories (run, bin, src, doc) and the
> respective sizes. All these textual representations are concatenated to
> the texlive.tlpdb.

What I was thinking about at first was something quite similar to
texlive.tlpdb, but with the metadata for each package and doc file. Such
a file would be generated once at TL installation and also whenever a
package is installed or removed (unless it doesn't contain the
"installed" status, in which case it need only be generated at TL
installation).

But as explained above, we can also reduce data redundancy by having one
copy of the Catalogue in some known place on the filesystem (and provide
support for extending this data by administrators and distributors).

If you choose to have a copy of the Catalogue, and if having one file
per package as discussed just above is unacceptable, then the Catalogue
has to be compiled into one big XML file before being installed in the
filesystem.

> Optimal would be that these files can be *generated* from what is there,
> so the Catalogue and the installed files.

Parse error, but I believe this should be addressed by all of the above,
right?

> Example: Before we had in the tpm files:
> - list of files      generated from the svn repository with some scripts
> - descriptions       now and then updated from the catalogue or by hand
> - licenses           now and then updated from the catalogue
> - patterns for files        manually maintained
> - installation instruction: manually maintained
> - ...
> Now all the stuff that should be updated was always out of date, wrong,
> conflicting etc.

The "big files" I was talking about were not supposed to be stored in
the TL repository, but rather generated by the TL installer (or Debian
scripts). In this case, they cannot be out-of-date with respect to what
is installed.

> If we now create another place which can become out of date this is
> counter productive. Therefore I proposed to somehow include the
> information in the catalogue.

I have nothing against putting the info in the Catalogue, but even with
some data redundancy (a big file containing the metadata from the
Catalogue + the install paths for each doc file as given by the TL
installer), there is a way to make sure the data is never out-of-date.
This way is: have the file generated by the TL installer.

But, as explained, it is also possible to have a copy of the Catalogue
in some known place and additional files containing only new data, and
make the link between both. In this case, we need to be able to link by
CTAN package name and by document id (in order to get the metadata such
as the language for each specific document).

> So since I hate long discussions and like to come up with solutions,
> what do we actually have to do to get these things working, let's call
> it a work plan:
>
> This is preliminary, please extend it with your local knowledge, ie, if
> you have suggestions for CTAN, add it, or for Catalogue, or whatever.

It's OK, but a bit vague. We are in the data structures right now. :)

Regards,

-- 
Florent


More information about the tex-live mailing list