[tex-live] Better ways to find packages and documentation

Florent Rougon f.rougon at free.fr
Thu Jul 5 00:16:17 CEST 2007


Hi,

Norbert Preining <preining at logic.at> wrote:

>> Then make reality fit theory. :)
>
> This is what we assume! If problems occur we fix them.

Fine.

> The only problem is that I am genetically disabled for understanding
> Python ;-)

Tsssk tsssk, did you even *try*?

Remember the old saying:

                    ___________________________________
                   < Python is executable pseudo-code. >
                    -----------------------------------
                           \   ^__^
                            \  (oo)\_______
                               (__)\       )\/\
                                   ||----w |
                                   ||     ||

> Question: Is any file included in more than 1 TLPOBJ?
>
> Answer:
> 	current format:
> 		grep '^ ' texlive.tlpdb | sort | uniq --repeated

Beeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeep!

Nope, this only works if two packages have the exact same file *paths*.
The more useful thing would be to detect files with the same basename
(which is, mmmyes, easy, but *you* are trying to convince me, not the
other way 'round :).

> 	xml format:
> 		shoot yourself ...

Well, well, well, it's just a few Python lines of code away. Not shell,
I admit.

> If you need more examples ...

OTOH, XML is eXtensible. *I*'ll give an example: when I started with my
little movie catalog in XML, I had no <comments> elements yet (for
storing my comments about a movie). When I found that useful, I was able
to add them in the DB and the Python script was still working with *no*
*modification* *at all*. It simply ignored these new elements. When I
later had a bit more time, I implemented handling of these fields and
added the corresponding markup in my LaTeX template files.

So, you can extend the file format with no modification whatsoever to
the program reading it, it still just works. When you have time, you can
then extend the program to take advantage of the new data. Same thing
with attributes (X-rated="yes" and such ;-) I hadn't thought of at
first.

You can often do similar things with custom formats, but it is not
always so automatic or elegant.

> Another advantage was that I could write a (extremely dump, but working)
> shell library to access stuff in the tlpdb quite fast. Now, how to do
> this in XML???

Well, you cannot do the parsing in shell, but if it's really a library,
you can write it Python and then provide a Python script that exposes a
command-line interface to the library. That's what I did with PyXMMS.

Anyway, your shell library is only doable in full shell if the format is
very simple. Otherwise, it becomes a *real pain* to parse correctly. For
instance, you'll tell me how to parse this in pure shell:

 path/to/manual.pdf details="pdfTeX User Manual" language="en"

(don't tell me you'll do that with "eval", because then I could easily
log into your computer and do pretty nice things)

> The question is about *WHAT* do we win when using xml wrt some
> structured text. I see nothing.

Of course, you can write structured text that looks like XML. But if you
want to support arbitrary nesting of structures and fields containing
spaces and/or newlines, I guess you will most probably end up with
either XML or some worse non-standard format.

One nice thing with XML is that the structure in the file can be
directly mirrored to a structured object in
Python/Perl/whatever-decent-programming-language (shell isn't one).

It's usually just a matter of a "for" loop and a library call to fill
your data structure with what you have in the XML file...

>> > - tagging is done on a per package level, not per file level
>> 
>> OK.
>> 
>> Hum, well, does everyone agree? :)
>
> Irrelevant, it is only you and me. Since I will implement this on the
> TeX Live side, and the net win will be for everyone. But see below,
> again.

I was saying that because it was difficult for me to give up on the idea
of document-level tags, and we seemed to be at the point of no return
concerning this decision, so I was somehow hoping that someone else
would object to this regression in available features. :)

> Furthermore I propose that we could extend the format of the docfiles
> lines as follows:
> docfiles size=*****
>  file1 attrib1=value1 attrib2=value2 ...
>  file2 attrib1=value1 ...

This is more or less OK, but looks more and more like XML. :)

And you'll have to come even closer to XML, because you need to quote
the attribute values, since I need the "details" attributes from the
Catalogue in order to display a nice description of each document in the
UI:

  <documentation details='Manual, PDF version:'  language='en'
                 href='ctan:/macros/latex/contrib/hyperref/doc/manual.pdf'/>
  <documentation details='Summary of options:'  language='en'
                 href='ctan:/macros/latex/contrib/hyperref/doc/options.pdf'/>

So, it's not:

 file1 attrib1=value1 attrib2=value2 ...
 file2 attrib1=value1 ...

but rather:

 file1 attrib1="value1" attrib2="value2" ...
 file2 attrib1="value1" ...

and you'll have to make up yet another quoting scheme for the cases
where we need a double quote in an attribute value... See how you're
slowly reinventing XML? :)

,----
| Theorem (F. Rougon, 2007)
| 
| Any custom text file format tends to become a degraded version of XML
| as adding features requires to extend it.
`----

> That is an easy extension of the syntax, and we could carry over the
> Catalogue contained attributes of files to the texlive.tlpdb. 

Sure, but you'll have to explicitely carry over every needed attribute.
Since the Catalogue DTD isn't likely to change every day, this won't be
a problem in practice, though...

> Furthermore, we add (optionally) for every TLPOBJ a line
> 	tags <tag1> <tag2> <tag3> ...
> to get the per packages tagging.

OK (yes, I admit we don't really need spaces/newlines in tag names).

> Now the only problem is that we have to get this information from the
> Catalogue to the texlive.tlpdb and its generation time. This is a
> problem only for me I guess, but this I can handle.

Thanks.

> This way you don't have to have access to anything else but the
> texlive.tlpdb, local.tlpdb, local added .xml/whatever files for
> TEXMFLOCAL.

Perfect.

> Does this sound reasonable?

Yes.

> If you agree on that, we should start (in private email) to write a
> decent proposal with:
> - rational

That's spellt "rationale"...

> - format changes to the infra structure of TeX Live

I suppose this should be in the files describing the TL infra in SVN,
no? I mean, the CTAN maintainers don't need to approve changes in the TL
infra, only those to the Catalogue, do they?

> - changes necessary for the Catalogue
>   . DTD changes
>   . upload/handling changes

Well, for upload and things like that, we can propose several
possibilities as already mentioned, but there are policy decisions that
*the CTAN maintainers* have to make, such as whether to use override
files, whether to blindly trust metadata from package official
maintainers, how to authenticate them, etc. (yes, I don't think they'll
want to require PGP-signed uploads, so authentication is probably
impossible to achieve...)

> - specification of a file format for the upload specification

Really, we shouldn't make this up without their input. There are many
possibilities, with XML, RFC-2822, etc. Well, we can always propose
something, but chances are good it will end up in the trashcan, so...

>>From my side as TL guy I see no problem in adding those tags/attributes.
> It will not blow up the tlpdb too much.

Surely, long descriptions will blow it up much more. But you *have* to
be able to cope with spaces in attribute values, in order to store the
short description for each doc file.

Question:

  Where will I find the various tlpdb files on the installed system?
  Currently, there is only one such file, but there are several TEXMF
  trees, so it is either in only one of the them (ugh), or preferably
  split among them. Since you said that a given TL package cannot be
  split among several TEXMF trees, it shouldn't be difficult to put into
  each TEXMF tree the part of the monolithic tlpdb file that contains
  all packages from that tree. Is this what you intend to do?

  If so, then the texmf[-{dist,doc}]/ part should be stripped as part of
  the tlpdb split process, since it has no meaning within a TEXMF tree.

Also, I don't know much how you make the Debian packages, but will you
be able to easily adapt all this for Debian, since we don't have the
same TEXMF trees as in upstream TL? (well, if the file paths relative to
the base of the TEXMF trees don't change, it's probably trivial since I
guess that basically everything goes to /usr/share/texmf-texlive, but
better think about that now than too late).

> For the CTAN it actually depends on the changes, but AFAIS now we only
> need one more XML whatever entity for the tags.
                    ^^^^^^^^^^^^^^^
                        element

An entity is something like "&foobar;".

(yes, I know you may be doing that intentionally to make XML look
complicated :)

(and yes, I do think XML is complicated *if* you want to understand all
the myriad of extensions around it such as XPath, XLink, XML Schemas,
RELAX-NG, XWhatever, but basic XML is simple)

With your answers to this mail, I should be able to start working (which
doesn't mean you'll see immediate results, because I'll have to learn Qt
again and see how to work with libtagcoll, but being able to assemble
all the parts in my little head will greatly improve my peace of mind
:).

Regards,

-- 
Florent


More information about the tex-live mailing list