Development history

This is a story about version control and preserving some of my personal source code history, and involves 2 programs I wrote called "sv" and "histarc".

Back story

In 2007 I wrote a basic version control thing in Python.

[sv.py]
#!/usr/bin/python

# Snapsnot Versions - a sort of cheap version history system.
#
# Create a "manifest" file in a directory containing a list of
# files that comprise the module.  The module name is the basename
# of the directory.
#
# Snapshot indexes are stored in,
#   ~/.sv/modules/<module>/<timestamp>.sv
#
# Contents are stored under,
#   ~/.sv/hashdir/<hash-string>

Ha, I just noticed the typo, "snapsnot" should be "snapshot".

Anyway, why would I write something like this, when there were several existing version control systems? Its a little hard to recall after 10 years. I had used CVS since 1997 or so, and apparently I was unhappy enough with it that I wanted to try something new. Subversion existed, and I had tried it with one project, but I hated the convention of having trunk, tags, and branches exposed at the top level. Whether it was required or just a convention, ugh, I don't want to see those folders all the time. Hide that shit in a dot-folder. Git had also existed since 2005, and I read about it and really liked the idea of content-addressable storage by hash checksum. Maybe I wasn't ready to take the dive into Git, or I found a depth-first read of the Git documentation too overwhelming, so I put off learning it. So I wrote "sv" for several reasons:

sv was just a few hundred lines of Python code in a single module. It had a simple set of commands,

with suboptions on some of them. It did not allow for commit messages, but I could include a ChangeLog file in any project, which has the benefit of preserving history on raw export. sv had one feature that was kind of unique: data deduplication even between modules, by virtue of having a common "hashdir" where all the files were stored. Not good for scaling, but fine for my purposes.

Here is an example of basic usage of sv,

$ ls -1 > manifest
$ sv snapshot .
$ sv history

I eventually had 67 sv modules on my main development system,

21lex                  histarc                 pictbridgeprinter    tsmerge
aria                   idsystem                pidlink              tssplit
avedit                 inventory               platforms            tstodvd
avplug                 jamie1                  printapp             tvcap
avswitcher             lcmeter                 progrunner           uback
backups                libdv                   ptprof               updater
bin                    linux-2.6.34-buici-jsg  python               upload_linux
bluebutton             lpd                     rcm                  usb-debug
buildroot-2010.02-lpd  modc                    rfemon               vfilter
bwi-samplecode         modc.exp1               scanner-locator      webapp
convert                movie-scripts           scripts              webapps
cti                    nac                     sdlgl                wicn
dvd-guinan-family      ncjpeg                  sftpserver           wtut_ver
EH-system              ncjpeg-take2            sully_www_localhost  x.mail
**********             nslu2s                  swdev                ZBECAL
fotofwd                panhandler              tmp14                zfetch
hackaday               pbsmtp                  tsgraph

and a total of 5400 files in my hashdir folder, comprising all unique versions of every file in all modules.

I'll also mention that over the years I had acculumated 80 CVS modules (folders), with some overlap on the sv modules above,

accounting       dcam         jpkg                  pagegen       soundedit
account_manager  drpm         jumprocket            panels        tetrisgame
ACCTMGR          ds1xfer      lantv                 parser        timeline
agents           dv           *********_Feb_1999_a  pclock        timetracker
backdoor         eclone       matrix                pcsupportinc  tooldisk
balls            electronics  ***_March_1999_a      povstuff      usb-debug
bbadmin          ezmenu       minitools             pse           usine
bbadmin2         fatpipe      miptexviewer          python-stuff  uwm
bbbackup         **********   mkc2                  quake2src     vbitv
bbmss            fui          modelling             quickgui      wake
bbtv             gasket       multienv              resume        webapps
bluebutton       gclassgen    netprint              rocketry      wicn
cdrip            homedir      noisyfw               scan_buy      www.bluebutton.com
clients          hrt          NuppelVideo           scripts       wxtris
ctod             jame         nuvx                  serialstuff   x.mail
dbcop            jkt          oggcutter             *****_www     xvidmode

I won't get further into CVS here, but I mention it because I might do similar things to below with my old CVS trees in the future.

Learning and using Git

I finally started using Git in 2014. I also started logging my command-line usage that year using a program I wrote called histarc, so with a little command-line magic I can generate a histogram of my Git usage since 2014,

$ histarc query 'git ' | grep '^git' | awk '{print $1, $2}' | sort | uniq -c | sort -rn | head -n 20
6547 git diff
2842 git commit
2119 git push
2117 git pull
956 git log
818 git add
579 git remote
477 git branch
451 git status
417 git checkout
288 git clone
105 git merge
98 git rm
98 git dif (common typo)
83 git mv
71 git init
70 git tag
29 git stash
27 git blame
25 git config

I find that pretty interesting.

Git has a large number of subcommands, each with their own options. Indeed, counting the links from the main git man page to other man pages like "git-add(1)",

$ man git | grep '^.......git-.*(1)' | wc -l
138

That is a very broad and deep tree to browse, and that is why I will almost certainly never call myself a git "expert". My usage histogram above could serve as a basis for an "introduction to git", but there are enough of those already.

Merging some old history

So here I am today, with the impetus for this tech note: I started tracking several projects in Git over the past few years, some of which I had previously tracked in CVS or sv, but I did not bother to import the old commits, I just started a Git tree from the latest version. Now I would like to have the old history preserved and integrated into my Git trees. It is part of my personal history as a developer, and thus important to me.

I've been working on my CTI project since 2010, and it has some "lost history" in sv, so it is exactly the kind of thing I'm looking to recover.

Each commit in sv is named by the Epoch timestamp at the time of commit, which sv history decodes for readability,

$ sv history cti | tac
1297972635.sv Thu Feb 17 14:57:15 2011 - first sv commit
1300904413.sv Wed Mar 23 14:20:13 2011
1305392704.sv Sat May 14 13:05:04 2011
1305560596.sv Mon May 16 11:43:16 2011
1305602816.sv Mon May 16 23:26:56 2011
...
1400497382.sv Mon May 19 07:03:02 2014
1401311877.sv Wed May 28 17:17:57 2014
1401396295.sv Thu May 29 16:44:55 2014
1401660042.sv Sun Jun  1 18:00:42 2014
1401842819.sv Tue Jun  3 20:46:59 2014 - first Git commit around this date
1401906205.sv Wed Jun  4 14:23:25 2014
1401906708.sv Wed Jun  4 14:31:48 2014
...
1403624566.sv Tue Jun 24 11:42:46 2014
1404337656.sv Wed Jul  2 17:47:36 2014
1407782371.sv Mon Aug 11 14:39:31 2014
1416317815.sv Tue Nov 18 08:36:55 2014 - last sv commit ever

I tracked CTI in both sv and Git for several months, while I was gaining some experience and confidence in Git, so you can see the overlap in dates,

$ git log --reverse | grep -E 'Date:|^commit'
commit 127a1d99d5ae1f090c49d4dce00cd1aa02a1d014  - first Git commit
Date:   Tue Jun 3 20:32:49 2014 -0400
commit 3b4762a1d71d3b0dd407aa9afe010a19828e33a9
Date:   Tue Jun 3 20:47:10 2014 -0400
commit cf9e06dd7b2cc17a645a58181d5ef64b2562f003
Date:   Wed Jun 4 14:23:06 2014 -0400
commit 535fe3a0451e056c92e587cd005b8390b4a8aea6
Date:   Wed Jun 4 14:31:20 2014 -0400
...
commit 3c5b393876dd488cc5a6b0e44f7e0b6220c5a9b7
Date:   Tue Jun 24 11:44:37 2014 -0400
commit 981f1c7d9e8f062a41710e3a56d5e240024a88f7
Date:   Sat Jun 28 22:23:42 2014 -0400
commit 485e03a3828093c329dd817941b83b800dc68693
Date:   Mon Aug 11 14:24:19 2014 -0400
commit 5d071de6c34aa76ce70f79341d4542664b38185b
Date:   Tue Oct 14 17:45:57 2014 -0400
commit 700ee01a659b9e1161fc33d9fc8ee2f22594e384
Date:   Tue Nov 11 09:11:21 2014 -0500          ,- last sv commit somewhere here
commit ca84cb8a6df09f7474437c0450149f88a544685b -- first Git-only commit
Date:   Sat Nov 22 11:01:37 2014 -0500         
commit 6cfc1fcd91aa942efd67caaa325cb6335d94ed75
Date:   Sat Nov 22 11:21:21 2014 -0500
...

Here are the questions/problems that I pose:

I'll start by checking out the current CTI Git tree,

$ git clone git@github.com:jamieguinan/cti.git
Cloning into 'cti'...
remote: Counting objects: 2332, done.
remote: Compressing objects: 100% (97/97), done.
remote: Total 2332 (delta 88), reused 148 (delta 82), pack-reused 2153
Receiving objects: 100% (2332/2332), 990.89 KiB | 0 bytes/s, done.
Resolving deltas: 100% (1346/1346), done.
Checking connectivity... done.

$ cd cti

Make an empty branch,

$ git checkout --orphan sv-history
Switched to a new branch 'sv-history'

$ git status
On branch sv-history
Initial commit
Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
    new file:   AAC.c
    new file:   AAC.h
    new file:   ALSACapRec.c
    new file:   ALSACapRec.h
    new file:   ALSAMixer.c
    ...

Clean up the files,

$ git rm --cached $(find * -type f)
$ rm -v $(find * -type f)

Now I have an empty folder, on a new git branch called sv-history.

Next I'll use tgz to get a tarball of the very first sv snapshot of the project, from well before I started using git,

$ cd ..
$ sv tgz cti sv=1297972635.sv
'/home/guinan/.sv/hashdir/1191842484f7c4fedf100f4f09cc312e' -> './svtmp/cti/ALSAio.c'
'/home/guinan/.sv/hashdir/63fc40a598c53cef113c64f260f9b2a2' -> './svtmp/cti/ALSAio.h'
'/home/guinan/.sv/hashdir/d9331774e99c78cdf4ea7879ed1a08b8' -> './svtmp/cti/AVIDemux.c'
'/home/guinan/.sv/hashdir/e1e33e77b4324e7ab90f4515c66b969e' -> './svtmp/cti/AVIDemux.h'
'/home/guinan/.sv/hashdir/68b329da9893e34099c7d8ad5cb9c940' -> './svtmp/cti/AVIMux.c'
...
cti/
cti/JpegTran.h
cti/tv.cmd
cti/avv.cmd
cti/Effects01.c
cti/ov511-server.cmd

$ ls -l cti.tgz
-rw-rw-r-- 1 guinan guinan 149549 Mar 22 10:24 cti.tgz

Unpack it, and add all the files to the git sv-history branch,

$ tar -xvf cti.tgz
$ cd cti
$ git add $(find * -type f)
$ git commit --date '1297972635 -0500' -m 'Initial sv import.'

Knowing the structure of my sv repository, I can use some Python code to generate the sv history diffs as files named /tmp/diffs/,

>>> import os
>>> os.mkdir('/tmp/diffs')    
>>> os.chdir('/home/guinan/.sv/modules/cti')
>>> versions = sorted(os.listdir('.'))  # the snapshot version files are named "<timestamp>.sv"
>>> for i in range(len(versions)-1):os.system('sv diff cti sv1=%s sv2=%s > /tmp/diffs/%s' % (versions[i], versions[i+1], versions[i+1][:-3]))

Test the 2nd commit to make sure it will apply nicely,

$ cat /tmp/diffs/1300904413 | patch -p1 --dry-run
checking file ALSAio.c
checking file Array.h
checking file Audio.c
checking file Audio.h
checking file DO511.c
...

Great. Now this long bash invocation (admittedly the result of about 30 minutes of trial and error), to apply all the sv commits into Git,

 for f in $(ls /tmp/diffs/); \
 do echo -n "next $f: "; read xyz; \
 cat /tmp/diffs/$f | patch -p1 ; \
 git add $(find * -type f); \
 git commit --date "$(date +'%s %z' --date=@$f)" -a -m "sv import patch $f"; \
 done

At this point, I had intended to try something like this or this to insert old sv commits before the first commit on master, but then I realized, why bother? The important thing is that I have the old sv history in the ongoing Git tree, and its fine that its on another branch, appropriately named sv-history. No need to disturb the master branch and upset the clones I have on several different systems.

I captured the whole process (except for the parts requiring Python and sv) in a script, which I could replay any number of times as long as I didn't push back to upstream,

#!/bin/bash
rm -rf cti
git clone git@github.com:jamieguinan/cti.git
cd cti
git checkout --orphan sv-history
git rm --cached $(find * -type f)
history
rm $(find * -type f)
(cd ..; tar -xf cti.tgz)
git add $(find * -type f)
git commit --date '1297972635 -0500' -m 'Initial sv import.'
# See python code above for populating /tmp/diffs
for f in $(ls /tmp/diffs/); do echo $f; cat /tmp/diffs/$f | patch -p1 ; git add $(find * -type f); git commit --date "$(date +'%s %z' --date=@$f)" -a -m "sv import patch $f"; done

Summary

Several good things came out of this effort.