view en/intro.tex @ 267:d145739b7865

The script actually only creates 35 changesets (as I understand it)
author bab@draketo.de
date Fri, 31 Aug 2007 19:44:35 +0200
parents 9dbed77d3ba6
children 4700dd38384c
line wrap: on
line source

\chapter{Introduction}
\label{chap:intro}

\section{About revision control}

Revision control is the process of managing multiple versions of a
piece of information.  In its simplest form, this is something that
many people do by hand: every time you modify a file, save it under a
new name that contains a number, each one higher than the number of
the preceding version.

Manually managing multiple versions of even a single file is an
error-prone task, though, so software tools to help automate this
process have long been available.  The earliest automated revision
control tools were intended to help a single user to manage revisions
of a single file.  Over the past few decades, the scope of revision
control tools has expanded greatly; they now manage multiple files,
and help multiple people to work together.  The best modern revision
control tools have no problem coping with thousands of people working
together on projects that consist of hundreds of thousands of files.

\subsection{Why use revision control?}

There are a number of reasons why you or your team might want to use
an automated revision control tool for a project.
\begin{itemize}
\item It will track the history and evolution of your project, so you
  don't have to.  For every change, you'll have a log of \emph{who}
  made it; \emph{why} they made it; \emph{when} they made it; and
  \emph{what} the change was.
\item When you're working with other people, revision control software
  makes it easier for you to collaborate.  For example, when people
  more or less simultaneously make potentially incompatible changes,
  the software will help you to identify and resolve those conflicts.
\item It can help you to recover from mistakes.  If you make a change
  that later turns out to be in error, you can revert to an earlier
  version of one or more files.  In fact, a \emph{really} good
  revision control tool will even help you to efficiently figure out
  exactly when a problem was introduced (see
  section~\ref{sec:undo:bisect} for details).
\item It will help you to work simultaneously on, and manage the drift
  between, multiple versions of your project.
\end{itemize}
Most of these reasons are equally valid---at least in theory---whether
you're working on a project by yourself, or with a hundred other
people.

A key question about the practicality of revision control at these two
different scales (``lone hacker'' and ``huge team'') is how its
\emph{benefits} compare to its \emph{costs}.  A revision control tool
that's difficult to understand or use is going to impose a high cost.

A five-hundred-person project is likely to collapse under its own
weight almost immediately without a revision control tool and process.
In this case, the cost of using revision control might hardly seem
worth considering, since \emph{without} it, failure is almost
guaranteed.

On the other hand, a one-person ``quick hack'' might seem like a poor
place to use a revision control tool, because surely the cost of using
one must be close to the overall cost of the project.  Right?

Mercurial uniquely supports \emph{both} of these scales of
development.  You can learn the basics in just a few minutes, and due
to its low overhead, you can apply revision control to the smallest of
projects with ease.  Its simplicity means you won't have a lot of
abstruse concepts or command sequences competing for mental space with
whatever you're \emph{really} trying to do.  At the same time,
Mercurial's high performance and peer-to-peer nature let you scale
painlessly to handle large projects.

No revision control tool can rescue a poorly run project, but a good
choice of tools can make a huge difference to the fluidity with which
you can work on a project.

\subsection{The many names of revision control}

Revision control is a diverse field, so much so that it doesn't
actually have a single name or acronym.  Here are a few of the more
common names and acronyms you'll encounter:
\begin{itemize}
\item Revision control (RCS)
\item Software configuration management (SCM), or configuration management
\item Source code management
\item Source code control, or source control
\item Version control (VCS)
\end{itemize}
Some people claim that these terms actually have different meanings,
but in practice they overlap so much that there's no agreed or even
useful way to tease them apart.

\section{A short history of revision control}

The best known of the old-time revision control tools is SCCS (Source
Code Control System), which Marc Rochkind wrote at Bell Labs, in the
early 1970s.  SCCS operated on individual files, and required every
person working on a project to have access to a shared workspace on a
single system.  Only one person could modify a file at any time;
arbitration for access to files was via locks.  It was common for
people to lock files, and later forget to unlock them, preventing
anyone else from modifying those files without the help of an
administrator.  

Walter Tichy developed a free alternative to SCCS in the early 1980s;
he called his program RCS (Revison Control System).  Like SCCS, RCS
required developers to work in a single shared workspace, and to lock
files to prevent multiple people from modifying them simultaneously.

Later in the 1980s, Dick Grune used RCS as a building block for a set
of shell scripts he initially called cmt, but then renamed to CVS
(Concurrent Versions System).  The big innovation of CVS was that it
let developers work simultaneously and somewhat independently in their
own personal workspaces.  The personal workspaces prevented developers
from stepping on each other's toes all the time, as was common with
SCCS and RCS.  Each developer had a copy of every project file, and
could modify their copies independently.  They had to merge their
edits prior to committing changes to the central repository.

Brian Berliner took Grune's original scripts and rewrote them in~C,
releasing in 1989 the code that has since developed into the modern
version of CVS.  CVS subsequently acquired the ability to operate over
a network connection, giving it a client/server architecture.  CVS's
architecture is centralised; only the server has a copy of the history
of the project.  Client workspaces just contain copies of recent
versions of the project's files, and a little metadata to tell them
where the server is.  CVS has been enormously successful; it is
probably the world's most widely used revision control system.

In the early 1990s, Sun Microsystems developed an early distributed
revision control system, called TeamWare.  A TeamWare workspace
contains a complete copy of the project's history.  TeamWare has no
notion of a central repository.  (CVS relied upon RCS for its history
storage; TeamWare used SCCS.)

As the 1990s progressed, awareness grew of a number of problems with
CVS.  It records simultaneous changes to multiple files individually,
instead of grouping them together as a single logically atomic
operation.  It does not manage its file hierarchy well; it is easy to
make a mess of a repository by renaming files and directories.  Worse,
its source code is difficult to read and maintain, which made the
``pain level'' of fixing these architectural problems prohibitive.

In 2001, Jim Blandy and Karl Fogel, two developers who had worked on
CVS, started a project to replace it with a tool that would have a
better architecture and cleaner code.  The result, Subversion, does
not stray from CVS's centralised client/server model, but it adds
multi-file atomic commits, better namespace management, and a number
of other features that make it a generally better tool than CVS.
Since its initial release, it has rapidly grown in popularity.

More or less simultaneously, Graydon Hoare began working on an
ambitious distributed revision control system that he named Monotone.
While Monotone addresses many of CVS's design flaws and has a
peer-to-peer architecture, it goes beyond earlier (and subsequent)
revision control tools in a number of innovative ways.  It uses
cryptographic hashes as identifiers, and has an integral notion of
``trust'' for code from different sources.

Mercurial began life in 2005.  While a few aspects of its design are
influenced by Monotone, Mercurial focuses on ease of use, high
performance, and scalability to very large projects.

\section{Trends in revision control}

There has been an unmistakable trend in the development and use of
revision control tools over the past four decades, as people have
become familiar with the capabilities of their tools and constrained
by their limitations.

The first generation began by managing single files on individual
computers.  Although these tools represented a huge advance over
ad-hoc manual revision control, their locking model and reliance on a
single computer limited them to small, tightly-knit teams.

The second generation loosened these constraints by moving to
network-centered architectures, and managing entire projects at a
time.  As projects grew larger, they ran into new problems.  With
clients needing to talk to servers very frequently, server scaling
became an issue for large projects.  An unreliable network connection
could prevent remote users from being able to talk to the server at
all.  As open source projects started making read-only access
available anonymously to anyone, people without commit privileges
found that they could not use the tools to interact with a project in
a natural way, as they could not record their changes.

The current generation of revision control tools is peer-to-peer in
nature.  All of these systems have dropped the dependency on a single
central server, and allow people to distribute their revision control
data to where it's actually needed.  Collaboration over the Internet
has moved from constrained by technology to a matter of choice and
consensus.  Modern tools can operate offline indefinitely and
autonomously, with a network connection only needed when syncing
changes with another repository.

\section{A few of the advantages of distributed revision control}

Even though distributed revision control tools have for several years
been as robust and usable as their previous-generation counterparts,
people using older tools have not yet necessarily woken up to their
advantages.  There are a number of ways in which distributed tools
shine relative to centralised ones.

For an individual developer, distributed tools are almost always much
faster than centralised tools.  This is for a simple reason: a
centralised tool needs to talk over the network for many common
operations, because most metadata is stored in a single copy on the
central server.  A distributed tool stores all of its metadata
locally.  All else being equal, talking over the network adds overhead
to a centralised tool.  Don't underestimate the value of a snappy,
responsive tool: you're going to spend a lot of time interacting with
your revision control software.

Distributed tools are indifferent to the vagaries of your server
infrastructure, again because they replicate metadata to so many
locations.  If you use a centralised system and your server catches
fire, you'd better hope that your backup media are reliable, and that
your last backup was recent and actually worked.  With a distributed
tool, you have many backups available on every contributor's computer.

The reliability of your network will affect distributed tools far less
than it will centralised tools.  You can't even use a centralised tool
without a network connection, except for a few highly constrained
commands.  With a distributed tool, if your network connection goes
down while you're working, you may not even notice.  The only thing
you won't be able to do is talk to repositories on other computers,
something that is relatively rare compared with local operations.  If
you have a far-flung team of collaborators, this may be significant.

\subsection{Advantages for open source projects}

If you take a shine to an open source project and decide that you
would like to start hacking on it, and that project uses a distributed
revision control tool, you are at once a peer with the people who
consider themselves the ``core'' of that project.  If they publish
their repositories, you can immediately copy their project history,
start making changes, and record your work, using the same tools in
the same ways as insiders.  By contrast, with a centralised tool, you
must use the software in a ``read only'' mode unless someone grants
you permission to commit changes to their central server.  Until then,
you won't be able to record changes, and your local modifications will
be at risk of corruption any time you try to update your client's view
of the repository.

\subsubsection{The forking non-problem}

It has been suggested that distributed revision control tools pose
some sort of risk to open source projects because they make it easy to
``fork'' the development of a project.  A fork happens when there are
differences in opinion or attitude between groups of developers that
cause them to decide that they can't work together any longer.  Each
side takes a more or less complete copy of the project's source code,
and goes off in its own direction.

Sometimes the camps in a fork decide to reconcile their differences.
With a centralised revision control system, the \emph{technical}
process of reconciliation is painful, and has to be performed largely
by hand.  You have to decide whose revision history is going to
``win'', and graft the other team's changes into the tree somehow.
This usually loses some or all of one side's revision history.

What distributed tools do with respect to forking is they make forking
the \emph{only} way to develop a project.  Every single change that
you make is potentially a fork point.  The great strength of this
approach is that a distributed revision control tool has to be really
good at \emph{merging} forks, because forks are absolutely
fundamental: they happen all the time.  

If every piece of work that everybody does, all the time, is framed in
terms of forking and merging, then what the open source world refers
to as a ``fork'' becomes \emph{purely} a social issue.  If anything,
distributed tools \emph{lower} the likelihood of a fork:
\begin{itemize}
\item They eliminate the social distinction that centralised tools
  impose: that between insiders (people with commit access) and
  outsiders (people without).
\item They make it easier to reconcile after a social fork, because
  all that's involved from the perspective of the revision control
  software is just another merge.
\end{itemize}

Some people resist distributed tools because they want to retain tight
control over their projects, and they believe that centralised tools
give them this control.  However, if you're of this belief, and you
publish your CVS or Subversion repositories publically, there are
plenty of tools available that can pull out your entire project's
history (albeit slowly) and recreate it somewhere that you don't
control.  So while your control in this case is illusory, you are
forgoing the ability to fluidly collaborate with whatever people feel
compelled to mirror and fork your history.

\subsection{Advantages for commercial projects}

Many commercial projects are undertaken by teams that are scattered
across the globe.  Contributors who are far from a central server will
see slower command execution and perhaps less reliability.  Commercial
revision control systems attempt to ameliorate these problems with
remote-site replication add-ons that are typically expensive to buy
and cantankerous to administer.  A distributed system doesn't suffer
from these problems in the first place.  Better yet, you can easily
set up multiple authoritative servers, say one per site, so that
there's no redundant communication between repositories over expensive
long-haul network links.

Centralised revision control systems tend to have relatively low
scalability.  It's not unusual for an expensive centralised system to
fall over under the combined load of just a few dozen concurrent
users.  Once again, the typical response tends to be an expensive and
clunky replication facility.  Since the load on a central server---if
you have one at all---is orders of magnitude lower with a distributed
tool (because all of the data is replicated everywhere), a single
cheap server can handle the needs of a much larger team, and
replication to balance load becomes a simple matter of scripting.

If you have an employee in the field, troubleshooting a problem at a
customer's site, they'll benefit from distributed revision control.
The tool will let them generate custom builds, try different fixes in
isolation from each other, and search efficiently through history for
the sources of bugs and regressions in the customer's environment, all
without needing to connect to your company's network.

\section{Why choose Mercurial?}

Mercurial has a unique set of properties that make it a particularly
good choice as a revision control system.
\begin{itemize}
\item It is easy to learn and use.
\item It is lightweight.
\item It scales excellently.
\item It is easy to customise.
\end{itemize}

If you are at all familiar with revision control systems, you should
be able to get up and running with Mercurial in less than five
minutes.  Even if not, it will take no more than a few minutes
longer.  Mercurial's command and feature sets are generally uniform
and consistent, so you can keep track of a few general rules instead
of a host of exceptions.

On a small project, you can start working with Mercurial in moments.
Creating new changes and branches; transferring changes around
(whether locally or over a network); and history and status operations
are all fast.  Mercurial attempts to stay nimble and largely out of
your way by combining low cognitive overhead with blazingly fast
operations.

The usefulness of Mercurial is not limited to small projects: it is
used by projects with hundreds to thousands of contributors, each
containing tens of thousands of files and hundreds of megabytes of
source code.

If the core functionality of Mercurial is not enough for you, it's
easy to build on.  Mercurial is well suited to scripting tasks, and
its clean internals and implementation in Python make it easy to add
features in the form of extensions.  There are a number of popular and
useful extensions already available, ranging from helping to identify
bugs to improving performance.

\section{Mercurial compared with other tools}

Before you read on, please understand that this section necessarily
reflects my own experiences, interests, and (dare I say it) biases.  I
have used every one of the revision control tools listed below, in
most cases for several years at a time.

\subsection{Subversion}

Subversion is a popular revision control tool, developed to replace
CVS.  It has a centralised client/server architecture.

Subversion and Mercurial have similarly named commands for performing
the same operations, so it is easy for a person who is familiar with
one to learn to use the other.  Both tools are portable to all popular
operating systems.

Subversion lacks a history-aware merge capability, forcing its users
to manually track exactly which revisions have been merged between
branches.  If users fail to do this, or make mistakes, they face the
prospect of manually resolving merges with unnecessary conflicts.

Mercurial has a substantial performance advantage over Subversion on
every revision control operation I have benchmarked.  I have measured
its advantage as ranging from a factor of two to a factor of six when
compared with Subversion~1.4.3's \emph{ra\_local} file store, which is
the fastest access method available).  In more realistic deployments
involving a network-based store, Subversion will be at a substantially
larger disadvantage.  Because many Subversion commands must talk to
the server and Subversion does not have useful replication facilities,
server capacity becomes a bottleneck for modestly large projects.

Additionally, Subversion incurs a substantial storage overhead to
avoid network transactions for a few common operations, such as
finding modified files (\texttt{status}) and displaying modifications
against the current revision (\texttt{diff}).  A Subversion working
copy is, as a result, often approximately the same size as, or larger
than, a Mercurial repository and working directory, even though the
Mercurial repository contains a complete history of the project.

Subversion is currently more widely supported by
revision-control-aware third party tools than is Mercurial, although
this gap is closing.  Like Mercurial, Subversion has an excellent user
manual.

Several tools exist to accurately and completely import revision
history from a Subversion repository into a Mercurial repository,
making the transition from the older tool relatively painless.

\subsection{Git}

Git is a distributed revision control tool that was developed for
managing the Linux kernel source tree.  Like Mercurial, its early
design was somewhat influenced by Monotone.

Git has an overwhelming command set, with version~1.5.0 providing~139
individual commands.  It has a reputation for being difficult to
learn.  It does not have a user manual, only documentation for
individual commands.

In terms of performance, git is extremely fast.  There are several
common cases in which it is faster than Mercurial, at least on Linux.
However, its performance on (and general support for) Windows is, at
the time of writing, far behind that of Mercurial.

While a Mercurial repository needs no maintenance, a Git repository
requires frequent manual ``repacks'' of its metadata.  Without these,
performance degrades, while space usage grows rapidly.  A server that
contains many Git repositories that are not rigorously and frequently
repacked will become heavily disk-bound during backups, and there have
been instances of daily backups taking far longer than~24 hours as a
result.  A freshly packed Git repository is slightly smaller than a
Mercurial repository, but an unpacked repository is several orders of
magnitude larger.

The core of Git is written in C.  Many Git commands are implemented as
shell or Perl scripts, and the quality of these scripts varies widely.
I have encountered a number of instances where scripts charged along
blindly in the presence of errors that should have been fatal.

\subsection{CVS}

CVS is probably the most widely used revision control tool in the
world.  Due to its age and internal untidiness, it has been
essentially unmaintained for many years.

It has a centralised client/server architecture.  It does not group
related file changes into atomic commits, making it easy for people to
``break the build'': one person can successfully commit part of a
change and then be blocked by the need for a merge, causing other
people to see only a portion of the work they intended to do.  This
also affects how you work with project history.  If you want to see
all of the modifications someone made as part of a task, you will need
to manually inspect the descriptions and timestamps of the changes
made to each file involved (if you even know what those files were).

CVS has a muddled notion of tags and branches that I will not attempt
to even describe.  It does not support renaming of files or
directories well, making it easy to corrupt a repository.  It has
almost no internal consistency checking capabilities, so it is usually
not even possible to tell whether or how a repository is corrupt.  I
would not recommend CVS for any project, existing or new.

Mercurial can import CVS revision history.  However, there are a few
caveats that apply; these are true of every other revision control
tool's CVS importer, too.  Due to CVS's lack of atomic changes and
unversioned filesystem hierarchy, it is not possible to reconstruct
CVS history completely accurately; some guesswork is involved, and
renames will usually not show up.  Because a lot of advanced CVS
administration has to be done by hand and is hence error-prone, it's
common for CVS importers to run into multiple problems with corrupted
repositories (completely bogus revision timestamps and files that have
remained locked for over a decade are just two of the less interesting
problems I can recall from personal experience).

\subsection{Commercial tools}

Perforce has a centralised client/server architecture, with no
client-side caching of any data.  Unlike modern revision control
tools, Perforce requires that a user run a command to inform the
server about every file they intend to edit.

The performance of Perforce is quite good for small teams, but it
falls off rapidly as the number of users grows beyond a few dozen.
Modestly large Perforce installations require the deployment of
proxies to cope with the load their users generate.

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "00book"
%%% End: