Idea: Git as basis for future CTAN and TeX Live. (Discuss here or at tomorrow's TeX Hour)

Henri Menke henri at henrimenke.de
Sun Jun 27 21:02:13 CEST 2021


On Sun, 2021-06-27 at 19:58 +0200, Patrick wrote:
> (Sorry for the double post, Jonathan)
> 
> I used to mirror CTAN with a git repository (a commit of the current
> status every day). It grew so big, that was completely
> unmaintainable.
> Git was not suitable for that. I have not tried git large file
> storage, but I doubt that it would have helped me. My goal was to
> create a real archive, which CTAN, despite its name, is not.

Git LFS is not a good solution in my experience. There are a couple of
issues with the protocol, e.g. it only runs over HTTP not SSH.

The first thing to try would be to turn off Git's delta compression
which quickly becomes very problematic to uncompressible binary files
(of which there are a lot on CTAN).

    git config core.bigFileThreshold 1

The other alternative would be to not track the file contents in Git at
all but instead only track the metadata and store the bare files on
disk unchanged. There is a tool for that called git-annex.

    https://git-annex.branchable.com/

Finally, the last option could be to just use SVN which is what TeX
Live does. The reason why people use Git rather than SVN is because SVN
is pretty bad at branching and has to copy the entire worktree for each
branch, but on the other hand it is much better with large files and if
you are only tracking CTAN, there will probably be a linear history
with a single branch anyway.

Cheers, Henri

> 
> That said, I think it would not take much change to CTAN to make it
> more suitable for distributing as a Git repository.
> 
> Patrick
> 
> Am Mi., 23. Juni 2021 um 18:58 Uhr schrieb Jonathan Fine <
> jfine2358 at gmail.com>:
> > 
> > Hi
> > 
> > As well as being a version control system, Git is a distributed
> > peer-
> > to-peer content addressable store. It's also efficient in its use
> > of
> > network bandwidth and mass storage. And it uses multiple cores when
> > possible, so it's also quick. And it is, of course, widely used.
> > 
> > All this makes git a good foundation for rethinking CTAN and TeX
> > Live. This post explores this idea. We focus on git's use of
> > PACKFILES to do peer-to-peer file sharing.
> > 
> > When you clone a repository, the repository being cloned creates a
> > single git pack file (and associated index file perhaps), which is
> > then sent to the newly created local repository. From this, if
> > required, the working files are created.
> > 
> > If you do a pull from a source, the same process takes place,
> > except
> > that the two repositories first do some negotiation to determine
> > what
> > should be sent. And then as before a pack file is sent. And a push
> > is
> > similar. (Actually, in both cases, it might be several pack files.)
> > Rsync, used by CTAN, also does peer-to-peer negotiation.
> > 
> > Here's an example a git pull
> > 
> > $ git clone git at github.com:jgm/pandoc.git
> > Cloning into 'pandoc'...
> > [snip]
> > 
> > $ ls -l pandoc/.git/objects/pack/
> > total 53480
> > -r--r--r-- 1 jfine jfine  2.8M Jun 23 17:12 pack-53640....idx
> > -r--r--r-- 1 jfine jfine 50M Jun 23 17:12 pack-53640....pack
> > 
> > And now I've got every version of every file in the history of
> > pandoc
> > (up to the commit I pulled). That's not bad for 50M. (The index can
> > be computed from the pack. It speeds disc access.)
> > 
> > For GitLab the size limit is 10GB per repository. For GitHub the
> > size
> > limit is about 5GB. Norbert Preining's git-svn mirror of TeX Live
> > is
> > about 40GB.
> > 
> > https://about.gitlab.com/blog/2015/04/08/gitlab-dot-com-storage-limit-raised-to-10gb-per-repo/
> > https://github.community/t/working-with-large-files-and-repositories/10203
> > https://texlive.info/
> > 
> > Let me end with a question. It's related to hosting TeX Live on
> > GitHub and GitLab.
> > 
> > First, consider all files in any version of TeX Live that are used
> > by
> > any subscriber to this list as inputs to TeX or any of its
> > associated
> > programs. (This definition is crafted to exclude documentation
> > files.
> > And files not in TeX Live. It's the files in TeX Live that TeX or
> > whatever inputs when typesetting.)
> > 
> > Now for the question. Put all these files in a git pack file. How
> > big
> > will that pack file be? Perhaps powers of 2 is the way to ask this.
> > In other words, at most 250M? At most 500M? At most 1G? At most 2G?
> > At most 4G? At most 8G? At most 16G? At most 32G? At most 64G?
> > [Stop
> > here because Norbert's git-svn mirror provides 40G a bound.]
> > 
> > If we're at most 5GB then we can use both GitHub and GitLab to host
> > these files. And the TeX Collection / TeX Live could store this
> > material as git pack files. This would make the DVD a 
> > https://en.wikipedia.org/wiki/Sneakernet for some TeX-related git
> > repositories.
> > 
> > Still here? Well done. I'll be discussing this, read-only file
> > systems, immutable OSes and related methods at tomorrow evening's
> > TeX
> > Hour.
> > 
> > When and where. Thursday 17 June, 6.30 to 7.30pm UK time. The UK
> > time
> > now is at https://time.is/UK. The zoom details are
> > https://us02web.zoom.us/j/78551255396?pwd=cHdJN0pTTXRlRCtSd1lCTHpuWmNIUT09
> > Meeting ID: 785 5125 5396
> > Passcode: knuth
> > 
> > For the keen: READ-ONLY FILE SYSTEMS
> > https://en.wikipedia.org/wiki/Zero_Install
> > https://en.wikipedia.org/wiki/Snap_(package_manager)
> > https://archive.fosdem.org/2017/schedule/event/desktops_bundling_kde/
> > https://en.wikipedia.org/wiki/Flatpak
> > 
> > For the very keen: IMMUTABLE OSes
> > https://www.theregister.com/2021/06/16/systemd_249_release_candidate/
> > https://www.theregister.com/2021/04/01/systemd_248/
> > https://www.theregister.com/2021/02/18/kinoite_immutable_fedora/
> > 
> > Finally, video from last week's TeX Hour is available at
> > https://www.youtube.com/playlist?list=PLw1FZfIX1w7hwBDqZoii3eOtd-RMivznf
> > 
> > --
> > Jonathan
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> 





More information about the tex-live mailing list.