Centralized v.s. Decentralized Version Control

There are a few tools used in any software project. The most obvious ones are an interpreter or compiler, and text editors. Version control systems are also a part of any reasonably managed project, and recently distributed version control systems such as git have become available. This have made some activies easier, such as:

  • Offline development while still tracking change history. More importantly having that change history able to be merged into main tree when next online.
  • Branches hosted on different repositories/hosts, while still allowing robust integration of changes.
  • Syncing submission over unusual transport mechanisms. For example I've submitted a change to a darcs repository over 'carrier pigeon', a sd card physically carried between the two machines.

And while the traditional centralized control systems can provide the above usages, it often involves working against the system rather than with it. However there are some drawbacks to distributed version control systems as well.

For someone running a backup of their system a lot of data is now duplicated between multiple systems. This adds either complexity or space requirements to your backup depending on how it is approached. For a commercial project the loss of a change even if temporary represents lost work hours. If the system encourages a series of local submits before pushing to the main system this can represent weeks worth of work. However storing the whole duplicated repository has its own risks as a large project over large number of developers quickly means very large space requirements.

Also change identifiers are no longer simple. With a centralized system it is normally an incremented integer, however for the distributed systems this isn't the case due to the requirement to avoid collisions on change identifiers between separate hosts. An example machine generated identifier for git is c82a22c39cbc32576f64f5c6b3f24b99ea8149c7. To get around this distributed revision control systems may refer to user specified tags instead. However this imposes the assignment of simple identifiers onto the user and no longer has any guarantee of uniqueness between repositories.

Finally the additional repositories can raise issues regarding merge responsibility. With a central process the person making the change is generally imposed with any merge actions at the time of their submission. However Abe can now integrate his change to Bob's repository, and Bob may then be burdened with the merge when he attempts to sync with Carol's repository. The freedom to merge changes from a location other than a central repository brings the burden of having to merge those changes as well when syncing with a separate repository.

However these problems are not insurmountable.

The backup problem can be solved by leveraging the features of a distributed revision control system. A list of patch identifiers can quickly mark what patches are not backed up. This also allows the remaining backup to be simple the files not matching the repository.

The issue of unique yet simple change numbers can also easily be dealt with. By marking a host as authoritative for a branch, you can refer to change numbers in relation to that host. This host can then use the integer increment method of centralized version control systems to create this simple identifier. As long as a group agrees which authoritative host they mean when referring to a patch they can use 1121347 instead of c82a22c39cbc32576f64f5c6b3f24b99ea8149c7.

The concept of an authoritative host can also aid in the merge case. Lets say one has two major projects in two geographical locations. Obviously for productivity you don't want to have to connect to the other side of the world to submit a change every time. However you may still want to get updates from the other project as well as occasionally integrate a change from one project to the other. The reverse of an authoritative host for a branch would be a proxy host for a branch. This is a host that for a particular branch would only accept changes from a specified authoritative host. Merging becomes a non-issue between these two hosts for that branch as the merge action is done on submission to the authoritative branch.

This leads to the short list of features I'd like to have when working with a distributed version control system.

  • The notion of proxy and authoritative repository for a branch. A repository can be marked as being authoritative for a branch, allowing other repositories being able to be marked as proxies to the authoritative repository for that branch.
  • The ability to get simplified patch identifers from an authoritative branch
  • A form of backup sync, so that a minimal amount of data needs to be stored to backup a number of repositories to a single location.

While I'm not sure how to achieve simplified patch identifiers without knowing more about gits internals than I do, the other two features should be somewhat easier. This may mean my next post will have to be how I managed in implementing the above suggestions around git. If one can disable pushes for a branch and automate or trigger pulls that is enough for proxy branches. The backup should be achievable by a script that runs diff, changes, and describe, along with a staged area for repositories being backed up. From my point of view this is good news since the remaining issue is somewhat lower in usefulness.

Blog Topics: