[tex-live] [LONG] Improving TeX package classification and the associated documentaion
f.rougon at free.fr
Mon Jul 2 00:12:24 CEST 2007
karl at freefriends.org (Karl Berry) wrote:
> I see no hugely critical reason to support compressed pdf files. I
> don't think we should have them in TeX Live. The amount of space saved
> is not worth the ensuing trouble IMHO.
Yes, I think there is now a fair consensus that supporting externally
compressed PDF files isn't worth the trouble. It seems we're better off
waiting for PDF 1.5 to spread and maybe improving pdfTeX and other tools
for more efficient font embedding.
> If you want to rewrite texdoctk to somehow make use of this, perhaps
> using the Catalogue data as processed for TL by David's new Elisp (which
> I still haven't tried yet, sadly), go for it. I certainly agree that
> texdoctk.dat is not a good solution.
David's script is a simple way to have a usable offline catalogue *now*,
but for a good rewrite of texdoctk, I'd like something less heuristic,
that deals correctly in a deterministic way with files named manual.pdf,
be they from hyperref or some other package.
This does require integration with the TL installation procedure, but
AFAIU, this is the right moment and I have no problem working with
My current view of the Right Way would be the following:
Each package (in the sense of CTAN package, not Debian) contains an
XML file that specifies the following:
- what is the package useful for; I think the debtags approach would
be great for that, see below.
- what documentation files are provided in the package, with at a
* relative paths from the root of the package
* the language each file is written in
* a short description of each file (in most cases, the document
title should be OK).
One might want to tag the individual doc files with the debtags
approach, but I don't think we need so much granularity. I think
it is sufficient to put the tags at the CTAN package level.
Then, I need some information regarding where the files are installed
in TL. I suppose the TL installation infrastructure could easily
provide me with this.
A simple way to do that would be to have my program (let's call it
texcatalogue for now) find this information ready in a file listing at
least all documentation files, one per line like that:
<CTAN package name>/<relative path>;<installation path>
Let's call this file docinstall.log (or install.log in case you decide
to record all file installations, not only documentation files).
(this can also be in XML for more robustness, I don't really care)
Then other distributors such as MiKTeX could provide such a file in
order to have texcatalogue working for them.
By matching the individual metadata files for each package with
docinstall.log, I'd be able to find a given file for a given package
in the installed system.
OK, so I postponed so far the explanation of the big picture. Let's dive
into it now.
Anyone having managed a reasonable bookmark, software, music, etc.
collection the "obvious way" has probably found that the classic simple
static hierarchical classification scheme is bound to fail for the
- there are often documents (or more generally, objects of the
collection) that would fit into several folders (categories).
- some documents don't fit in any category, or you have to create
ridiculously small categories that contain very few documents and
clutter the category tree in an unacceptable way; or else you put
them in a "misc" category, which quickly becomes a mess.
- some categories appear as valid subcategories of several categories.
Then, you're sure to miss one of them when browsing the category
Simple example for software classification: I can have a structure like
but this could also be structured like that:
-> Graphic Work
Console-based interactive tools (e.g., using ncurses or slang such
(-> Graphic Work)
-> Graphic Work
If I'm looking for a sound player no matter the interface, the first
structure is better; if I'm specifically looking for a GUI tool for
Auntie, then the second is probably more convenient. Different needs,
This problem is very real for Debian users to find what they might want
to install among the 21000+ packages we have in sid at the moment. This
lead some clever people at Debian (mainly Enrico Zini, AFAIK) to do some
research on Library Science. They found that a nice solution for such
problems was devised in 1933 by Information Architecture specialists.
The basic concepts are relatively simple and explained here:
Debian's implementation of this scheme is called "debtags", whose home
page is here:
OK, so how does it work?
Simply, to each package, you attach not a folder, but a set of tags
(e.g., "font", "mathematics", "hebrew", etc.). This way, you can look
for everything related to maths, or every font available in TL, or every
math font, or every font that provides hebrew support, whatever.
This is not science fiction. Currently, on my Debian system, I can type:
debtags grep '(use::playing && ! use::recording) && works-with::audio && (interface::commandline || interface::text-mode)'
to find all packages that:
- are used for playing media, but not for recording;
- and work with audio media;
- and have an interface that is either command-line driven, or in
text-mode (ncurses, slang, etc., such as texconfig).
You can easily obtain a nice list like that:
Description: yet another ABC to PostScript converter
This program translates tunes written in the ABC format to PostScript,
which can then be viewed on screen or printed. It is essentially a
(non-exclusive) alternative to abc2ps, being based on the abc2ps
PostScript code together with the ABC parser from the abcmidi package.
People interested in this kind of software should also check out the
abcm2ps package, which contains a similar program that has lots of
Description: converter from ABC to MIDI format and back
This package contains the programs `abc2midi' and `midi2abc', which
convert from the abc musical notation format to standard MIDI format
and vice-versa. They can generate accompaniment from guitar chords
in the abc file, as well as insert various MIDI events; the
MIDI-to-abc translation tries to figure out bars, triplets and
accidentals on its own.
The package also contains `abc2abc' (an abc prettyprinter/transposer),
`mftext' (a program that dumps a MIDI file as text), and `midicopy'
(a program that extracts specific tracks, channels or time intervals
from a MIDI file).
The programs in this package are based on the `midifilelib'
distribution available from http://www.harmony-central.com/MIDI/.
with the following command:
debtags grep '(use::playing && ! use::recording) && works-with::audio && (interface::commandline || interface::text-mode)' \
| xargs -n 1 grep-available -F Package -s Package,Description
Note: it's a bit slow because it calls grep-available for every
package; if you want to make it faster, you have to combine
all package names output by the 'debtags' command into a
single filter expression using the grep-available syntax, like
grep-available -s Package,Description \
$(debtags grep '(use::playing && ! use::recording) && works-with::audio && (interface::commandline || interface::text-mode)' \
| cut -d: -f 1 | python -c '
return "-P %s -X" % pkg
pkglist = sys.stdin.read()[:-1].split("\n")
print " --or ".join(map(pkgexpr, pkglist))') | less
All this is explained in good detail here:
and in particular in this paper:
Looking at the previous expressions, you'll have noted the "::" stuff,
e.g. in 'use::playing' and 'interface::text-mode' (yes, it reminds me of
Perl, but I can live with it). What does it mean? Simply that the tags
are grouped into facets. For instance, looking at a given package, you
can ask yourself these questions:
- What is it useful for?
* browsing -> tag "use::browsing"
* chatting -> tag "use::chatting"
* printing -> tag "use::printing"
- What interface(s) does it present to the user?
* a command-line interface -> tag "interface::commandline"
* a text-based interactive interface
-> tag "interface::text-mode"
* a web interface -> tag "interface::web"
* an X11 interface -> tag "interface::x11"
These tags can be used to present a dynamic menu to the user. For
instance, he starts looking at the "use" facet (one menu entry). At this
point, he sees several entries: browsing, chatting, printing, etc. If he
chooses "chatting", he can now refine his search by chosing the
"interface" facet: this leads to a new submenu presenting the various
types of interfaces, etc.
OTOH, our user could have started by looking only at applications
offering an X11 (graphical) interface. The dynamic menu opened when
chosing "interface::x11" tag would then offer to choose among various
facets (or tags), one of which is "use". If he chooses "use" at this
point, he can narrow his search to web browsers by further choosing
"use::browsing" if he's looking for a browser. If he wants a particular
GUI toolkit (Qt, GTK, wxWidgets...), he can continue his exploration of
the dynamically-created tree with the "uitoolkit" facet, which groups
tags such as "uitoolkit::gtk", "uitoolkit::qt", "uitoolkit::wxwidgets",
I believe such a scheme would be nice to find one's way among the myriad
of things we have in CTAN. The debtags project provides programs and
libraries for general use (tagcolledit, libtagcoll), i.e. for managing
arbitrary collections of objects, not only Debian packages. So, this
should not be very difficult to use for managing the collection of CTAN
packages (I believe I'd have to write a Python wrapper for libtagcoll,
but that's something I can do).
The end result, as I currently see it, would be a 'texcatalogue' program
(or some other name, I don't really care) that could be used to
conveniently browse and search trough the collection of CTAN packages,
using this tag-based system.
Once a package is selected, it would be easy, thanks to the metadata I
mentioned at the beginning of this email, to present the relevant
documentation files to the user, in whatever languages he accepts to
read. I believe that would be a great replacement and enhancement for
There is a little issue I thought about: texdoctk has sections such as
"Guides and Tutorials" and "Fundamentals", which I find very useful and
don't want to lose. I may be wrong, but I don't think all documents in
these sections are part of CTAN packages. If this is true, the approach
described in this mail would require to ship these documentation files
in new CTAN packages. I don't think this is a serious problem.
We could tag such packages like that:
Font Installation Guide
A Gentle Introduction to TeX
pdfTeX User Manual
KOMA-Script User's Guide (german)
(another obvious classfamily would be latex-standard)
Of course, I just made this up in a few minutes, but defining a good set
of facets and tags requires much more thought than that.
In case you wonder, debtags can be told that one tag implies other tags,
so that we could have type::document-tutorial, type::document-guide and
type::document-user-manual all imply type::document to easily look for
documentation, be it tutorials or reference manuals or whatnot. BTW, you
may ask yourself why I distinguished a "user manual" and a "guide": I'd
say a guide doesn't aim at being exhaustive, whereas a user manual
should, as far as it falls in the scope of user stuff.
OK, I think that's enough for now and will let people comment. I have
probably answered most other mails in this message, but I'll look again
at them individually tomorrow and reply to the questions that remain
Ah, I forgot one thing: if you want to play with the Debian tags
database, I suggest the following:
1. [if you have access to a Debian system]
the programs "packagesearch" and "debtags-edit", from the Debian
packages with the same names. For those who cannot try them at
home, there are some (old) screenshots here:
-> if you don't know which package names to enter, try debtags or
-> also try the "Tag cloud"; quite interesting IMHO
Please note that the Debian tags databases (vocabulary and package tags)
are huge and therefore still under construction. Not every package is
tagged, not every package that is tagged is correctly and completely
tagged, and even the tags vocabulary is still under development (if you
look at /var/lib/debtags/vocabulary, you'll see that the status of some
tags is "needing-review", some others "draft", or even "controversial").
However, AFAIUI, updates to the tags database on
debtags.alioth.debian.org are regularly validated by competent people,
so the data found there should not be complete nonsense, even if
Well, that's it now. Thanks for reading so far.
More information about the tex-live