[tex-live] [LONG] Improving TeX package classification and the associated documentaion

Mon Jul 2 00:12:24 CEST 2007

Hi,

karl at freefriends.org (Karl Berry) wrote:

> I see no hugely critical reason to support compressed pdf files.  I
> don't think we should have them in TeX Live.  The amount of space saved
> is not worth the ensuing trouble IMHO.

Yes, I think there is now a fair consensus that supporting externally
compressed PDF files isn't worth the trouble. It seems we're better off
waiting for PDF 1.5 to spread and maybe improving pdfTeX and other tools
for more efficient font embedding.

> If you want to rewrite texdoctk to somehow make use of this, perhaps
> using the Catalogue data as processed for TL by David's new Elisp (which
> I still haven't tried yet, sadly), go for it.  I certainly agree that
> texdoctk.dat is not a good solution.

David's script is a simple way to have a usable offline catalogue *now*,
but for a good rewrite of texdoctk, I'd like something less heuristic,
that deals correctly in a deterministic way with files named manual.pdf,
be they from hyperref or some other package.

This does require integration with the TL installation procedure, but
AFAIU, this is the right moment and I have no problem working with
Norbert.

My current view of the Right Way would be the following:

  Each package (in the sense of CTAN package, not Debian) contains an
  XML file that specifies the following:

    - what is the package useful for; I think the debtags approach would
      be great for that, see below.

    - what documentation files are provided in the package, with at a
      minimum:

        * relative paths from the root of the package

        * the language each file is written in

        * a short description of each file (in most cases, the document
          title should be OK).

      One might want to tag the individual doc files with the debtags
      approach, but I don't think we need so much granularity. I think
      it is sufficient to put the tags at the CTAN package level.

  Then, I need some information regarding where the files are installed
  in TL. I suppose the TL installation infrastructure could easily
  provide me with this.

  A simple way to do that would be to have my program (let's call it
  texcatalogue for now) find this information ready in a file listing at
  least all documentation files, one per line like that:

    <CTAN package name>/<relative path>;<installation path>

  Let's call this file docinstall.log (or install.log in case you decide
  to record all file installations, not only documentation files).

  (this can also be in XML for more robustness, I don't really care)

  Then other distributors such as MiKTeX could provide such a file in
  order to have texcatalogue working for them.

  By matching the individual metadata files for each package with
  docinstall.log, I'd be able to find a given file for a given package
  in the installed system.

OK, so I postponed so far the explanation of the big picture. Let's dive
into it now.

Anyone having managed a reasonable bookmark, software, music, etc.
collection the "obvious way" has probably found that the classic simple
static hierarchical classification scheme is bound to fail for the
following reasons:

  - there are often documents (or more generally, objects of the
    collection) that would fit into several folders (categories).

  - some documents don't fit in any category, or you have to create
    ridiculously small categories that contain very few documents and
    clutter the category tree in an unacceptable way; or else you put
    them in a "misc" category, which quickly becomes a mess.

  - some categories appear as valid subcategories of several categories.
    Then, you're sure to miss one of them when browsing the category
    tree.

Simple example for software classification: I can have a structure like
that:

  Graphic work
  Sound
    -> Players
    -> Editors
    -> Recorders
  Mathematics

but this could also be structured like that:

  Command-line tools
    -> Sound
    -> Graphic Work
    -> Mathematics
  Console-based interactive tools (e.g., using ncurses or slang such
  as texconfig)
    -> Sound
   (-> Graphic Work)
    -> Mathematics
  GUI tools
    -> Sound
    -> Graphic Work
    -> Mathematics

If I'm looking for a sound player no matter the interface, the first
structure is better; if I'm specifically looking for a GUI tool for
Auntie, then the second is probably more convenient. Different needs,
different structures.

This problem is very real for Debian users to find what they might want
to install among the 21000+ packages we have in sid at the moment. This
lead some clever people at Debian (mainly Enrico Zini, AFAIK) to do some
research on Library Science. They found that a nice solution for such
problems was devised in 1933 by Information Architecture specialists.
The basic concepts are relatively simple and explained here:

  http://debtags.alioth.debian.org/paper-debtags.html

Debian's implementation of this scheme is called "debtags", whose home
page is here:

  http://debtags.alioth.debian.org/

OK, so how does it work?

Simply, to each package, you attach not a folder, but a set of tags
(e.g., "font", "mathematics", "hebrew", etc.). This way, you can look
for everything related to maths, or every font available in TL, or every
math font, or every font that provides hebrew support, whatever.

This is not science fiction. Currently, on my Debian system, I can type:

  debtags grep '(use::playing && ! use::recording) && works-with::audio && (interface::commandline || interface::text-mode)'

to find all packages that:
  - are used for playing media, but not for recording;
  - and work with audio media;
  - and have an interface that is either command-line driven, or in
    text-mode (ncurses, slang, etc., such as texconfig).

\begin{parenthesis}
    You can easily obtain a nice list like that:

      Package: abcmidi-yaps
      Description: yet another ABC to PostScript converter
       This program translates tunes written in the ABC format to PostScript,
       which can then be viewed on screen or printed. It is essentially a
       (non-exclusive) alternative to abc2ps, being based on the abc2ps
       PostScript code together with the ABC parser from the abcmidi package.
       .
       People interested in this kind of software should also check out the
       abcm2ps package, which contains a similar program that has lots of
       additional features.

      Package: abcmidi
      Description: converter from ABC to MIDI format and back
       This package contains the programs `abc2midi' and `midi2abc',  which
       convert from the abc musical notation format to standard MIDI format
       and vice-versa. They can generate accompaniment from guitar chords
       in the abc file, as well as insert various MIDI events; the
       MIDI-to-abc translation tries to figure out bars, triplets and
       accidentals on its own.
       .
       The package also contains `abc2abc' (an abc prettyprinter/transposer),
       `mftext' (a program that dumps a MIDI file as text), and `midicopy'
       (a program that extracts specific tracks, channels or time intervals
       from a MIDI file).
       .
       The programs in this package are based on the `midifilelib'
       distribution available from http://www.harmony-central.com/MIDI/.

      [...]

    with the following command:

      debtags grep '(use::playing && ! use::recording) && works-with::audio && (interface::commandline || interface::text-mode)' \
              | xargs -n 1 grep-available -F Package -s Package,Description

    Note: it's a bit slow because it calls grep-available for every
          package; if you want to make it faster, you have to combine
          all package names output by the 'debtags' command into a
          single filter expression using the grep-available syntax, like
          this:

-------8<--------------------------------------------8<---------
      grep-available -s Package,Description \
        $(debtags grep '(use::playing && ! use::recording) && works-with::audio && (interface::commandline || interface::text-mode)' \
          | cut -d: -f 1 | python -c '
import sys

def pkgexpr(pkg):
    return "-P %s -X" % pkg

pkglist = sys.stdin.read()[:-1].split("\n")

print " --or ".join(map(pkgexpr, pkglist))') | less
-------8<--------------------------------------------8<---------

\end{parenthesis}

All this is explained in good detail here:

  http://debtags.alioth.debian.org/

and in particular in this paper:

  http://debtags.alioth.debian.org/paper-debtags.html

Looking at the previous expressions, you'll have noted the "::" stuff,
e.g. in 'use::playing' and 'interface::text-mode' (yes, it reminds me of
Perl, but I can live with it). What does it mean? Simply that the tags
are grouped into facets. For instance, looking at a given package, you
can ask yourself these questions:

  - What is it useful for?

    Possible answers:

      * browsing -> tag "use::browsing"
      * chatting -> tag "use::chatting"
      * printing -> tag "use::printing"

      etc.

  - What interface(s) does it present to the user?

    Possible answers:

      * a command-line interface -> tag "interface::commandline"
      * a text-based interactive interface
                                 -> tag "interface::text-mode"
      * a web interface          -> tag "interface::web"
      * an X11 interface         -> tag "interface::x11"

      etc.

These tags can be used to present a dynamic menu to the user. For
instance, he starts looking at the "use" facet (one menu entry). At this
point, he sees several entries: browsing, chatting, printing, etc. If he
chooses "chatting", he can now refine his search by chosing the
"interface" facet: this leads to a new submenu presenting the various
types of interfaces, etc.

OTOH, our user could have started by looking only at applications
offering an X11 (graphical) interface. The dynamic menu opened when
chosing "interface::x11" tag would then offer to choose among various
facets (or tags), one of which is "use". If he chooses "use" at this
point, he can narrow his search to web browsers by further choosing
"use::browsing" if he's looking for a browser. If he wants a particular
GUI toolkit (Qt, GTK, wxWidgets...), he can continue his exploration of
the dynamically-created tree with the "uitoolkit" facet, which groups
tags such as "uitoolkit::gtk", "uitoolkit::qt", "uitoolkit::wxwidgets",
etc.

I believe such a scheme would be nice to find one's way among the myriad
of things we have in CTAN. The debtags project provides programs and
libraries for general use (tagcolledit, libtagcoll), i.e. for managing
arbitrary collections of objects, not only Debian packages. So, this
should not be very difficult to use for managing the collection of CTAN
packages (I believe I'd have to write a Python wrapper for libtagcoll,
but that's something I can do).

The end result, as I currently see it, would be a 'texcatalogue' program
(or some other name, I don't really care) that could be used to
conveniently browse and search trough the collection of CTAN packages,
using this tag-based system.

Once a package is selected, it would be easy, thanks to the metadata I
mentioned at the beginning of this email, to present the relevant
documentation files to the user, in whatever languages he accepts to
read. I believe that would be a great replacement and enhancement for
texdoctk.

There is a little issue I thought about: texdoctk has sections such as
"Guides and Tutorials" and "Fundamentals", which I find very useful and
don't want to lose. I may be wrong, but I don't think all documents in
these sections are part of CTAN packages. If this is true, the approach
described in this mail would require to ship these documentation files
in new CTAN packages. I don't think this is a serious problem.

We could tag such packages like that:

  Font Installation Guide

    -> type::document-guide
    -> subject::fonts
    -> subject::installation
    -> macropackage:latex

  A Gentle Introduction to TeX

    -> type::document-tutorial
    -> macropackage::plain-tex

  pdfTeX User Manual

    -> type::document-user-manual
    -> engine::pdftex

  KOMA-Script User's Guide (german)

    -> type::document-guide
    -> macropackage::latex
    -> classfamily::koma-script

etc.

(another obvious classfamily would be latex-standard)

Of course, I just made this up in a few minutes, but defining a good set
of facets and tags requires much more thought than that.

In case you wonder, debtags can be told that one tag implies other tags,
so that we could have type::document-tutorial, type::document-guide and
type::document-user-manual all imply type::document to easily look for
documentation, be it tutorials or reference manuals or whatnot. BTW, you
may ask yourself why I distinguished a "user manual" and a "guide": I'd
say a guide doesn't aim at being exhaustive, whereas a user manual
should, as far as it falls in the scope of user stuff.

OK, I think that's enough for now and will let people comment. I have
probably answered most other mails in this message, but I'll look again
at them individually tomorrow and reply to the questions that remain
unanswered.

Ah, I forgot one thing: if you want to play with the Debian tags
database, I suggest the following:

  1. [if you have access to a Debian system]
     the programs "packagesearch" and "debtags-edit", from the Debian
     packages with the same names. For those who cannot try them at
     home, there are some (old) screenshots here:

       http://debtags.alioth.debian.org/paper-debtags.html#debtags-edit

  2. http://debian.vitavonni.de/packagebrowser/

  3. http://debtags.alioth.debian.org/edit.html

    -> if you don't know which package names to enter, try debtags or
       lmodern;
    -> also try the "Tag cloud"; quite interesting IMHO
       (http://debtags.alioth.debian.org/cloud/, needs JavaScript).

Please note that the Debian tags databases (vocabulary and package tags)
are huge and therefore still under construction. Not every package is
tagged, not every package that is tagged is correctly and completely
tagged, and even the tags vocabulary is still under development (if you
look at /var/lib/debtags/vocabulary, you'll see that the status of some
tags is "needing-review", some others "draft", or even "controversial").
However, AFAIUI, updates to the tags database on
debtags.alioth.debian.org are regularly validated by competent people,
so the data found there should not be complete nonsense, even if
incomplete.

Well, that's it now. Thanks for reading so far.

-- 
Florent