GAMS: svn+trac to GitLab

(Stefan Vigerske, 29th August 2018)

GAMS has moved one of its main repositories and bugtracking systems from Subversion+Trac to GitLab. This document describes some of the technicalities, e.g., tools, commands, and workarounds, and makes the script to Trac to GitLab conversion script available. Maybe it will be helpful for someone.

Starting Point

In 2007, GAMS moved from having no version control to use Subversion. A repository called "source" was created. In there a directory "products" was created to hold the complete source code and libraries of solver vendors for the GAMS distribution. The products directory contains a typical trunk/, branches/, tags/ directory structure. For bugtracking, Trac was setup and 1625 bugreports from an older handmade bugtracking system were imported into Trac.

The repository and bugtracking system has modestly grown over the last 11 years. The "source" repository, in which the "products" subdirectory takes a main part, holds 67720 commits and has a size of 33GB. The Trac system has grown to 3500 tickets and a size of 1.3GB.

Conversion of repository

We wanted to convert the "products" subdirectory of the Subversion repository "source" to a single Git repository, thereby preserving the complete history (maybe with some minor cleanup).

Since GAMS has regularly stored binary files in the Subversion repository, especially precompiled libraries of solver vendors, and will have to continue to do so, a pure Git repository would have become too large to be easy to clone by users. We therefore use Git-LFS to store most binary files, see also the Git-LFS documentation, Git-LFS in GitLab, and this Git-LFS tutorial.

svn2git

We used svn2git, which works on top of "git svn", to convert the complete products subdirectory of the "source" repo to a normal Git repository:

svn2git file://<path-to-source-repo>/products \
     --no-minimize-url \
     --authors authors.txt \
     --exclude '.*/tmp/' \
     --metadata \
     --verbose

The authors.txt file needs to contain a mapping from usernames in the subversion repository to names with e-mail addresses as they will then appear in Git commits. The --exclude '.*/tmp/' flag removes the content of every "tmp" directory in any commit. The --metadata flag ensures that every for Git commit, a reference to the svn commit will be included in the commit message in form or a git-svn-id.

This run of svn2git took about 15.5 hours.

Before running svn2git, we diabled the final call to git gc in the svn2git source (lib/svn2git/migration.rb:39), as we want to do some more manual cleanup before starting a long-running git gc call.

Further, svn2git checks out each branch to setup tracking information. Due to different EOL-handlings, already the checkout of a branch sometimes lead to local modifications in the files, which svn2git cannot handle. We thus added a line

run_command("git commit -a -m 'svn2git: fix eol'", false)

in the svn2git source (migration.rb:354 and migration.rb:380, after the commands to checkout a branch).

svn2git brought back branches and tags that were already deleted in subversion. So we deleted them again with calls to git branch -D and git tag -d.

Next, to obtain a small as possible clean-up git repository, we remove the svn remote and remote branches and run git garbage collection. This takes a while and requires a lot of memory:

git branch -r -d `git branch -r`
git config --local --remove-section svn
git config --local --remove-section svn-remote.svn
git config --local core.logAllRefUpdates  false
rm -r .git/logs/ .git/svn/
git reflog expire --expire=now --all
git gc --prune=now

Cleanup commit messages and EOL

Next, we run git filter-branch to modify commit messages and cleanup the EOL of text files:

git filter-branch \
    --msg-filter "sed -e 's/git-svn-id: .*\/\([a-zA-Z0-9\.]*\)@\([0-9]*\) .*/svn-id: \2 @\1/' \
      -e 's/\[ *t: *\([0-9][0-9]*\) *\]/ #\1/g' \
      -e 's/\[ *t: *\([0-9][0-9]*\)[ ,]*v:references* *\]/ #\1/g' \
      -e 's/\[ *t: *\([0-9][0-9]*\)[ ,]*v:close *]/ fixes #\1/g' \
      -e 's/\[ *t: *\([0-9][0-9]*\)[ ,]*v:fixed *]/ fixes #\1/g' \
      -e 's/\[ *v:close[ ,]*t: *\([0-9][0-9]*\) *]/ fixes #\1/g'" \
    --index-filter "cp $GITATTR .gitattributes ; git add .gitattributes" \
    --tag-name-filter cat \
    -d /mnt/ramdisk/gittmp \
    -- --all

The first sed command in the message filter replaces the long git-svn-id strings by more concise ones, which only mentions the svn revision and svn branch. The other sed commands make references to tickets better readable. That is, strings like [t:12424,v:close] are replaced by fixes #12424. Note, that GitLab will recognize a word of the form #[0-9]+ as ticket reference and makes it clickable in the commit view.

The index filter installs a file .gitattributes at the root of every checkout (thus it is committed with the first commit and then never deleted). The .gitattributes file defines that text files should be checked out using the machines own EOL style (LF on Linux, CRLF on Windows, etc.), but are stored with LF endings in the repository, see also this blog post and this help page. Git includes a heuristic to decide which files are text files. For certain filename endings, we overrule this heuristic (to be safe) and may also require a fixed EOL style:

# Set the default behavior, in case people don't have core.autocrlf set.
# This should usually take care of everything, but to be sure, we setup
# extra rules below.
* text=auto

# Explicitly declare text files you want to always be normalized and converted
# to native line endings on checkout.
*.c text
*.h text

# Declare files that will always have CRLF line endings on checkout.
*.sln text eol=crlf

# Declare files that will always have LF line endings on checkout.
*.sed text eol=lf

# Denote all files that are truly binary and should not be modified.
*.png binary
*.jpg binary
*.gif binary
*.eps binary
*.ps binary
*.pdf binary
*.a binary
*.lib binary
*.so binary
*.dll binary
*.exe binary
*.sln binary

Note, that this git filter-branch run may change (or "repair") the line-endings of already existing files in the history.

The tag name filter just ensures that existing tags will point to the modified commits.

Since git filter-branch checks out each commit in a working directory, a fast working directory is preferred. We created a ramdisk via mount -t tmpfs tmpfs /mnt/ramdisk and told git to use it via the -d flag.

git filter-branch took 11 hours to run, partially because I messed up my mountpoints, so what was intended to be a ramdisk was actually a network drive.

To cleanup, we delete the backup that filter-branch has created and run git gc again:

# remove backup of filter-branch (see it's man page)
git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d
git reflog expire --expire=now --all
git gc --prune=now

This git repository had a size of about 15 GB and contained 55773 commits.

Migrating to Git-LFS

We have tried several tools to move binary files from Git commits into the separate LFS storage. The build-in git lfs migrate seemed to work, but when trying to push the repository later to GitLab or when running git fsck, we got a lot of errors of the form

remote: error: object: duplicateEntries: contains duplicate file entries

We also tried git-lfs-migrate, but that didn't seem to work well (and I forgot in the meanwhile, what exactly the problem was). Eventually, we used BFG. A problem with BFG is that it creates .gitattributes files that are scattered around the repository and do not follow the format that git expects.

First, to figure out which files take a lot of space in the git repository, we ran git rev-list as suggested here:

git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| awk '/^blob/ {print substr($0,6)}' \
| sort --numeric-sort --key=2 \
| cut --complement --characters=13-40 \
| numfmt --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

This helps to make up a list of filename patterns to migrate to LFS.

Eventually, we run BFG as

bfg --no-blob-protection --private \
  --convert-to-git-lfs '{*.so,*.so.*,*.lib,*.LIB,*.dll,*.a,*.dylib,*.sl,*.jnilib,*.exe,*.zip,*.hmx,*.chm,*.pdf,_*.bmp,clip*.bmp,eq*.bmp,*.dmg,*.dmg.gz,*.dmg.bz2,*.AppImage,*.TTF}'

(I skipped some very GAMS-specific patterns.) Originally, we were using *.bmp instead of _*.bmp,clip*.bmp,eq*.bmp, but BFG skips the migration of files to LFS if their size if below 512 bytes. If the .gitattributes file were still specifying that all *.bmp files are stored in LFS, one would obtain warning messages like Encountered 1 file(s) that should have been pointers, but weren't when using the repository. Thus, we adjusted the filename patterns to avoid matching files below 512 bytes.

The BFG run took only 15 minutes to run.

Next, to install a correct .gitattributes file, we first deleted .gitattributes files from all directories in the git history and installed our own one in the root directory:

bfg --no-blob-protection --private --delete-files ".gitattributes"
git filter-branch \
    --index-filter "cp $GITATTR .gitattributes ; git add .gitattributes" \
    --tag-name-filter cat \
    -d /mnt/ramdisk/gittmp \
    -- --all

The .gitattributes file that we installed was the same as before (regarding the EOL handling) with additional statements for LFS:

# Denote files that should be stored in Git-LFS
*.so filter=lfs diff=lfs merge=lfs -text
*.so.* filter=lfs diff=lfs merge=lfs -text
*.lib filter=lfs diff=lfs merge=lfs -text
*.LIB filter=lfs diff=lfs merge=lfs -text
*.dll filter=lfs diff=lfs merge=lfs -text
*.a filter=lfs diff=lfs merge=lfs -text
*.dylib filter=lfs diff=lfs merge=lfs -text
*.sl filter=lfs diff=lfs merge=lfs -text
*.jnilib filter=lfs diff=lfs merge=lfs -text
*.exe filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.hmx filter=lfs diff=lfs merge=lfs -text
*.chm filter=lfs diff=lfs merge=lfs -text
*.pdf filter=lfs diff=lfs merge=lfs -text
_*.bmp filter=lfs diff=lfs merge=lfs -text
clip*.bmp filter=lfs diff=lfs merge=lfs -text
eq*.bmp filter=lfs diff=lfs merge=lfs -text
*.dmg filter=lfs diff=lfs merge=lfs -text
*.dmg.gz filter=lfs diff=lfs merge=lfs -text
*.dmg.bz2 filter=lfs diff=lfs merge=lfs -text
*.AppImage filter=lfs diff=lfs merge=lfs -text
*.TTF filter=lfs diff=lfs merge=lfs -text

We then do another round of cleanup and run git lfs install to be safe:

git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d
git lfs install
git reflog expire --expire=now --expire-unreachable=now --all
git gc --prune=now  --aggressive

This took again a few hours. The last gc call also reduced the number of commits from 55773 to 45841.

Since Git-LFS does not compress the files that it is storing, the Git repository has been growing from originally 15 GB (most of them for .git/objects) to 100GB (2.1GB for .git/objects, 98GB for .git/lfs):

$ du -shc *
4.0K  branches
4.0K  COMMIT_EDITMSG
4.0K  config
4.0K  description
4.0K  HEAD
68K   hooks
3.5M  index
24K   info
98G   lfs
2.1G  objects
12K   packed-refs
164K  refs
100G  total

Eventually, git lfs migrate info shows which filename patterns take most space in the converted Git repository:

*.pas 1.8 GB     37591/37591 files(s)  100%
*.c   927 MB     15493/15493 files(s)  100%
*.f90 788 MB     12242/12242 files(s)  100%
*.h   403 MB   259269/259271 files(s)  100%
*.inc 397 MB   541170/541172 files(s)  100%

The repository is now ready to be pushed to a GitLab server. (This took again a few hours.)

A git clone now initially downloads 3.7GB:

$ du -shc *
4.0K  branches
4.0K  config
4.0K  description
4.0K  HEAD
68K   hooks
3.5M  index
8.0K  info
3.3G  lfs
32K   logs
448M  objects
12K   packed-refs
28K   refs
3.7G  total

Migration of bugtracking

We looked at trac-to-gitlab and TracBoat for some automated way to convert Trac tickets to GitLab issues. The latter is said to have been inspired by the former, but was then heavily cleaned up an refactored. Both are Python scripts that access a running Trac instance via the XML-Rpc plugin. Both can write directly into the GitLab database, but trac-to-gitlab can alternatively use the GitLab REST API. Since writing to the GitLab database is not officially supported by GitLab and might require adjustments to the scripts to work with the current database scheme (tracboat supports up to GitLab 9.5, trac-to-gitlab up to 9.0), we decided to use the official GitLab API. trac-to-gitlab assumed the v3 API, but not much needs to be changed to work with the current v4 API.

While trac-to-gitlab included a lot of the essential functionality that we required, we were also missing several things. For example, trac-to-gitlab did not preserve information on who authored a comment on a ticket and when this has happened, nor was the history of attribute changes (components, ticket severity, ticket type) to a ticket preserved, but only the current list of attributes were converted into GitLab labels. Attachments were also not handled. Eventually, we did many modifications to the trac-to-gitlab source, thereby removing some unused parts and adjusting it exactly to our needs. The script can be downloaded here (no support, no maintenance):

We assume to have an "empty" GitLab server, that is, a server for which only the administration account exists and a project with the previously converted Git repository exists. The trac-to-gitlab conversion then converts each ticket from Trac to a GitLab issue, thereby ensuring a 1:1 mapping between ticket and issue numbers. Also we create users when necessary and upload attachments. To be able to preserve original authorship and creation and update dates, we access GitLab as admin user and also create new users as admins. The former allows us to impersonate all users, while the admin rights allow to set the creation and update date of issues and notes. Handing the update dates correctly requires a patch in GitLab, which should be part of GitLab 11.2. Unfortunately, the date of closing or reopening an issue or modifying labels can currently not be modified, so that we eventually resorted to some workaround (adding comments about these changes with correct date and only setting the final state and final list of labels).

Many Trac tickets referred to a svn commit by its revision number, either in comments that were added to a ticket directly in Trac, or due to a comment that was created by mentioning a ticket in a svn commit message. Using the svn-id information in the Git commit messages, we can create a svn revision to git commit hash mapping:

# for each commit, print the hash (%H), the summary (%s), and the body (%b), followed by a zero-byte
# remove all new-lines and replace zero-bytes by newlines -> every commit on one line
# let sed print hash, svn-id (incl branch, if there) for each line
git log --format="format:%H %s %b%x00" --all \
  | tr -d "\n" | tr "\000" "\n" \
  | sed -n \
    -e "/git-svn-id/d" \
    -e "/svn-id/s/\([0-9a-f]*\) .*svn-id: \([a-zA-Z0-9@]*\)/\1 \2/p"

We then use this mapping to replace most reference to svn revisions by references to git hashes when creating the GitLab issues.

Eventually, we were able to construct a pretty close representation of the Trac Tickets in GitLab. Information that got lost are: the changes in the cc of a ticket - we only kept the final list of subscribers; subscribers for which we do not create a GitLab account (in trac you can subscribe any e-mail, in GitLab only registered users); history of changes that were made to the ticket description and to the comment in a ticket (GitLab does not store the history of changes to a note or description, as far as I know); milestones (they could be converted automatically, but are rarely used at GAMS).

We started up trac with XML-RPC plugin via

trac-admin source.trac upgrade
trac-admin source.trac wiki upgrade
trac-admin source.trac config set components "tracrpc.*" enabled
trac-admin source.trac config set repositories .dir $PWD/source.svn
trac-admin source.trac permission add anonymous WIKI_VIEW TICKET_VIEW REPORT_VIEW FILE_VIEW LOG_VIEW MILESTONE_VIEW XML_RPC
tracd -b localhost -p 8080 -s -d source.trac/

and then run our Trac to GitLab migration script. The script took about 1-2 hours to run.