Mercurial > hgbook
comparison en/ch04-concepts.xml @ 831:acf9dc5f088d
Add a skeletal preface.
author | Bryan O'Sullivan <bos@serpentine.com> |
---|---|
date | Thu, 07 May 2009 21:07:35 -0700 |
parents | en/ch03-concepts.xml@18131160f7ee |
children |
comparison
equal
deleted
inserted
replaced
830:cbdff5945f9d | 831:acf9dc5f088d |
---|---|
1 <!-- vim: set filetype=docbkxml shiftwidth=2 autoindent expandtab tw=77 : --> | |
2 | |
3 <chapter id="chap:concepts"> | |
4 <?dbhtml filename="behind-the-scenes.html"?> | |
5 <title>Behind the scenes</title> | |
6 | |
7 <para id="x_2e8">Unlike many revision control systems, the concepts | |
8 upon which Mercurial is built are simple enough that it's easy to | |
9 understand how the software really works. Knowing these details | |
10 certainly isn't necessary, so it is certainly safe to skip this | |
11 chapter. However, I think you will get more out of the software | |
12 with a <quote>mental model</quote> of what's going on.</para> | |
13 | |
14 <para id="x_2e9">Being able to understand what's going on behind the | |
15 scenes gives me confidence that Mercurial has been carefully | |
16 designed to be both <emphasis>safe</emphasis> and | |
17 <emphasis>efficient</emphasis>. And just as importantly, if it's | |
18 easy for me to retain a good idea of what the software is doing | |
19 when I perform a revision control task, I'm less likely to be | |
20 surprised by its behavior.</para> | |
21 | |
22 <para id="x_2ea">In this chapter, we'll initially cover the core concepts | |
23 behind Mercurial's design, then continue to discuss some of the | |
24 interesting details of its implementation.</para> | |
25 | |
26 <sect1> | |
27 <title>Mercurial's historical record</title> | |
28 | |
29 <sect2> | |
30 <title>Tracking the history of a single file</title> | |
31 | |
32 <para id="x_2eb">When Mercurial tracks modifications to a file, it stores | |
33 the history of that file in a metadata object called a | |
34 <emphasis>filelog</emphasis>. Each entry in the filelog | |
35 contains enough information to reconstruct one revision of the | |
36 file that is being tracked. Filelogs are stored as files in | |
37 the <filename role="special" | |
38 class="directory">.hg/store/data</filename> directory. A | |
39 filelog contains two kinds of information: revision data, and | |
40 an index to help Mercurial to find a revision | |
41 efficiently.</para> | |
42 | |
43 <para id="x_2ec">A file that is large, or has a lot of history, has its | |
44 filelog stored in separate data | |
45 (<quote><literal>.d</literal></quote> suffix) and index | |
46 (<quote><literal>.i</literal></quote> suffix) files. For | |
47 small files without much history, the revision data and index | |
48 are combined in a single <quote><literal>.i</literal></quote> | |
49 file. The correspondence between a file in the working | |
50 directory and the filelog that tracks its history in the | |
51 repository is illustrated in <xref | |
52 linkend="fig:concepts:filelog"/>.</para> | |
53 | |
54 <figure id="fig:concepts:filelog"> | |
55 <title>Relationships between files in working directory and | |
56 filelogs in repository</title> | |
57 <mediaobject> | |
58 <imageobject><imagedata fileref="figs/filelog.png"/></imageobject> | |
59 <textobject><phrase>XXX add text</phrase></textobject> | |
60 </mediaobject> | |
61 </figure> | |
62 | |
63 </sect2> | |
64 <sect2> | |
65 <title>Managing tracked files</title> | |
66 | |
67 <para id="x_2ee">Mercurial uses a structure called a | |
68 <emphasis>manifest</emphasis> to collect together information | |
69 about the files that it tracks. Each entry in the manifest | |
70 contains information about the files present in a single | |
71 changeset. An entry records which files are present in the | |
72 changeset, the revision of each file, and a few other pieces | |
73 of file metadata.</para> | |
74 | |
75 </sect2> | |
76 <sect2> | |
77 <title>Recording changeset information</title> | |
78 | |
79 <para id="x_2ef">The <emphasis>changelog</emphasis> contains information | |
80 about each changeset. Each revision records who committed a | |
81 change, the changeset comment, other pieces of | |
82 changeset-related information, and the revision of the | |
83 manifest to use.</para> | |
84 | |
85 </sect2> | |
86 <sect2> | |
87 <title>Relationships between revisions</title> | |
88 | |
89 <para id="x_2f0">Within a changelog, a manifest, or a filelog, each | |
90 revision stores a pointer to its immediate parent (or to its | |
91 two parents, if it's a merge revision). As I mentioned above, | |
92 there are also relationships between revisions | |
93 <emphasis>across</emphasis> these structures, and they are | |
94 hierarchical in nature.</para> | |
95 | |
96 <para id="x_2f1">For every changeset in a repository, there is exactly one | |
97 revision stored in the changelog. Each revision of the | |
98 changelog contains a pointer to a single revision of the | |
99 manifest. A revision of the manifest stores a pointer to a | |
100 single revision of each filelog tracked when that changeset | |
101 was created. These relationships are illustrated in | |
102 <xref linkend="fig:concepts:metadata"/>.</para> | |
103 | |
104 <figure id="fig:concepts:metadata"> | |
105 <title>Metadata relationships</title> | |
106 <mediaobject> | |
107 <imageobject><imagedata fileref="figs/metadata.png"/></imageobject> | |
108 <textobject><phrase>XXX add text</phrase></textobject> | |
109 </mediaobject> | |
110 </figure> | |
111 | |
112 <para id="x_2f3">As the illustration shows, there is | |
113 <emphasis>not</emphasis> a <quote>one to one</quote> | |
114 relationship between revisions in the changelog, manifest, or | |
115 filelog. If a file that | |
116 Mercurial tracks hasn't changed between two changesets, the | |
117 entry for that file in the two revisions of the manifest will | |
118 point to the same revision of its filelog<footnote> | |
119 <para id="x_725">It is possible (though unusual) for the manifest to | |
120 remain the same between two changesets, in which case the | |
121 changelog entries for those changesets will point to the | |
122 same revision of the manifest.</para> | |
123 </footnote>.</para> | |
124 | |
125 </sect2> | |
126 </sect1> | |
127 <sect1> | |
128 <title>Safe, efficient storage</title> | |
129 | |
130 <para id="x_2f4">The underpinnings of changelogs, manifests, and filelogs are | |
131 provided by a single structure called the | |
132 <emphasis>revlog</emphasis>.</para> | |
133 | |
134 <sect2> | |
135 <title>Efficient storage</title> | |
136 | |
137 <para id="x_2f5">The revlog provides efficient storage of revisions using a | |
138 <emphasis>delta</emphasis> mechanism. Instead of storing a | |
139 complete copy of a file for each revision, it stores the | |
140 changes needed to transform an older revision into the new | |
141 revision. For many kinds of file data, these deltas are | |
142 typically a fraction of a percent of the size of a full copy | |
143 of a file.</para> | |
144 | |
145 <para id="x_2f6">Some obsolete revision control systems can only work with | |
146 deltas of text files. They must either store binary files as | |
147 complete snapshots or encoded into a text representation, both | |
148 of which are wasteful approaches. Mercurial can efficiently | |
149 handle deltas of files with arbitrary binary contents; it | |
150 doesn't need to treat text as special.</para> | |
151 | |
152 </sect2> | |
153 <sect2 id="sec:concepts:txn"> | |
154 <title>Safe operation</title> | |
155 | |
156 <para id="x_2f7">Mercurial only ever <emphasis>appends</emphasis> data to | |
157 the end of a revlog file. It never modifies a section of a | |
158 file after it has written it. This is both more robust and | |
159 efficient than schemes that need to modify or rewrite | |
160 data.</para> | |
161 | |
162 <para id="x_2f8">In addition, Mercurial treats every write as part of a | |
163 <emphasis>transaction</emphasis> that can span a number of | |
164 files. A transaction is <emphasis>atomic</emphasis>: either | |
165 the entire transaction succeeds and its effects are all | |
166 visible to readers in one go, or the whole thing is undone. | |
167 This guarantee of atomicity means that if you're running two | |
168 copies of Mercurial, where one is reading data and one is | |
169 writing it, the reader will never see a partially written | |
170 result that might confuse it.</para> | |
171 | |
172 <para id="x_2f9">The fact that Mercurial only appends to files makes it | |
173 easier to provide this transactional guarantee. The easier it | |
174 is to do stuff like this, the more confident you should be | |
175 that it's done correctly.</para> | |
176 | |
177 </sect2> | |
178 <sect2> | |
179 <title>Fast retrieval</title> | |
180 | |
181 <para id="x_2fa">Mercurial cleverly avoids a pitfall common to | |
182 all earlier revision control systems: the problem of | |
183 <emphasis>inefficient retrieval</emphasis>. Most revision | |
184 control systems store the contents of a revision as an | |
185 incremental series of modifications against a | |
186 <quote>snapshot</quote>. (Some base the snapshot on the | |
187 oldest revision, others on the newest.) To reconstruct a | |
188 specific revision, you must first read the snapshot, and then | |
189 every one of the revisions between the snapshot and your | |
190 target revision. The more history that a file accumulates, | |
191 the more revisions you must read, hence the longer it takes to | |
192 reconstruct a particular revision.</para> | |
193 | |
194 <figure id="fig:concepts:snapshot"> | |
195 <title>Snapshot of a revlog, with incremental deltas</title> | |
196 <mediaobject> | |
197 <imageobject><imagedata fileref="figs/snapshot.png"/></imageobject> | |
198 <textobject><phrase>XXX add text</phrase></textobject> | |
199 </mediaobject> | |
200 </figure> | |
201 | |
202 <para id="x_2fc">The innovation that Mercurial applies to this problem is | |
203 simple but effective. Once the cumulative amount of delta | |
204 information stored since the last snapshot exceeds a fixed | |
205 threshold, it stores a new snapshot (compressed, of course), | |
206 instead of another delta. This makes it possible to | |
207 reconstruct <emphasis>any</emphasis> revision of a file | |
208 quickly. This approach works so well that it has since been | |
209 copied by several other revision control systems.</para> | |
210 | |
211 <para id="x_2fd"><xref linkend="fig:concepts:snapshot"/> illustrates | |
212 the idea. In an entry in a revlog's index file, Mercurial | |
213 stores the range of entries from the data file that it must | |
214 read to reconstruct a particular revision.</para> | |
215 | |
216 <sect3> | |
217 <title>Aside: the influence of video compression</title> | |
218 | |
219 <para id="x_2fe">If you're familiar with video compression or | |
220 have ever watched a TV feed through a digital cable or | |
221 satellite service, you may know that most video compression | |
222 schemes store each frame of video as a delta against its | |
223 predecessor frame.</para> | |
224 | |
225 <para id="x_2ff">Mercurial borrows this idea to make it | |
226 possible to reconstruct a revision from a snapshot and a | |
227 small number of deltas.</para> | |
228 | |
229 </sect3> | |
230 </sect2> | |
231 <sect2> | |
232 <title>Identification and strong integrity</title> | |
233 | |
234 <para id="x_300">Along with delta or snapshot information, a revlog entry | |
235 contains a cryptographic hash of the data that it represents. | |
236 This makes it difficult to forge the contents of a revision, | |
237 and easy to detect accidental corruption.</para> | |
238 | |
239 <para id="x_301">Hashes provide more than a mere check against corruption; | |
240 they are used as the identifiers for revisions. The changeset | |
241 identification hashes that you see as an end user are from | |
242 revisions of the changelog. Although filelogs and the | |
243 manifest also use hashes, Mercurial only uses these behind the | |
244 scenes.</para> | |
245 | |
246 <para id="x_302">Mercurial verifies that hashes are correct when it | |
247 retrieves file revisions and when it pulls changes from | |
248 another repository. If it encounters an integrity problem, it | |
249 will complain and stop whatever it's doing.</para> | |
250 | |
251 <para id="x_303">In addition to the effect it has on retrieval efficiency, | |
252 Mercurial's use of periodic snapshots makes it more robust | |
253 against partial data corruption. If a revlog becomes partly | |
254 corrupted due to a hardware error or system bug, it's often | |
255 possible to reconstruct some or most revisions from the | |
256 uncorrupted sections of the revlog, both before and after the | |
257 corrupted section. This would not be possible with a | |
258 delta-only storage model.</para> | |
259 </sect2> | |
260 </sect1> | |
261 | |
262 <sect1> | |
263 <title>Revision history, branching, and merging</title> | |
264 | |
265 <para id="x_304">Every entry in a Mercurial revlog knows the identity of its | |
266 immediate ancestor revision, usually referred to as its | |
267 <emphasis>parent</emphasis>. In fact, a revision contains room | |
268 for not one parent, but two. Mercurial uses a special hash, | |
269 called the <quote>null ID</quote>, to represent the idea | |
270 <quote>there is no parent here</quote>. This hash is simply a | |
271 string of zeroes.</para> | |
272 | |
273 <para id="x_305">In <xref linkend="fig:concepts:revlog"/>, you can see | |
274 an example of the conceptual structure of a revlog. Filelogs, | |
275 manifests, and changelogs all have this same structure; they | |
276 differ only in the kind of data stored in each delta or | |
277 snapshot.</para> | |
278 | |
279 <para id="x_306">The first revision in a revlog (at the bottom of the image) | |
280 has the null ID in both of its parent slots. For a | |
281 <quote>normal</quote> revision, its first parent slot contains | |
282 the ID of its parent revision, and its second contains the null | |
283 ID, indicating that the revision has only one real parent. Any | |
284 two revisions that have the same parent ID are branches. A | |
285 revision that represents a merge between branches has two normal | |
286 revision IDs in its parent slots.</para> | |
287 | |
288 <figure id="fig:concepts:revlog"> | |
289 <title>The conceptual structure of a revlog</title> | |
290 <mediaobject> | |
291 <imageobject><imagedata fileref="figs/revlog.png"/></imageobject> | |
292 <textobject><phrase>XXX add text</phrase></textobject> | |
293 </mediaobject> | |
294 </figure> | |
295 | |
296 </sect1> | |
297 <sect1> | |
298 <title>The working directory</title> | |
299 | |
300 <para id="x_307">In the working directory, Mercurial stores a snapshot of the | |
301 files from the repository as of a particular changeset.</para> | |
302 | |
303 <para id="x_308">The working directory <quote>knows</quote> which changeset | |
304 it contains. When you update the working directory to contain a | |
305 particular changeset, Mercurial looks up the appropriate | |
306 revision of the manifest to find out which files it was tracking | |
307 at the time that changeset was committed, and which revision of | |
308 each file was then current. It then recreates a copy of each of | |
309 those files, with the same contents it had when the changeset | |
310 was committed.</para> | |
311 | |
312 <para id="x_309">The <emphasis>dirstate</emphasis> is a special | |
313 structure that contains Mercurial's knowledge of the working | |
314 directory. It is maintained as a file named | |
315 <filename>.hg/dirstate</filename> inside a repository. The | |
316 dirstate details which changeset the working directory is | |
317 updated to, and all of the files that Mercurial is tracking in | |
318 the working directory. It also lets Mercurial quickly notice | |
319 changed files, by recording their checkout times and | |
320 sizes.</para> | |
321 | |
322 <para id="x_30a">Just as a revision of a revlog has room for two parents, so | |
323 that it can represent either a normal revision (with one parent) | |
324 or a merge of two earlier revisions, the dirstate has slots for | |
325 two parents. When you use the <command role="hg-cmd">hg | |
326 update</command> command, the changeset that you update to is | |
327 stored in the <quote>first parent</quote> slot, and the null ID | |
328 in the second. When you <command role="hg-cmd">hg | |
329 merge</command> with another changeset, the first parent | |
330 remains unchanged, and the second parent is filled in with the | |
331 changeset you're merging with. The <command role="hg-cmd">hg | |
332 parents</command> command tells you what the parents of the | |
333 dirstate are.</para> | |
334 | |
335 <sect2> | |
336 <title>What happens when you commit</title> | |
337 | |
338 <para id="x_30b">The dirstate stores parent information for more than just | |
339 book-keeping purposes. Mercurial uses the parents of the | |
340 dirstate as <emphasis>the parents of a new | |
341 changeset</emphasis> when you perform a commit.</para> | |
342 | |
343 <figure id="fig:concepts:wdir"> | |
344 <title>The working directory can have two parents</title> | |
345 <mediaobject> | |
346 <imageobject><imagedata fileref="figs/wdir.png"/></imageobject> | |
347 <textobject><phrase>XXX add text</phrase></textobject> | |
348 </mediaobject> | |
349 </figure> | |
350 | |
351 <para id="x_30d"><xref linkend="fig:concepts:wdir"/> shows the | |
352 normal state of the working directory, where it has a single | |
353 changeset as parent. That changeset is the | |
354 <emphasis>tip</emphasis>, the newest changeset in the | |
355 repository that has no children.</para> | |
356 | |
357 <figure id="fig:concepts:wdir-after-commit"> | |
358 <title>The working directory gains new parents after a | |
359 commit</title> | |
360 <mediaobject> | |
361 <imageobject><imagedata fileref="figs/wdir-after-commit.png"/></imageobject> | |
362 <textobject><phrase>XXX add text</phrase></textobject> | |
363 </mediaobject> | |
364 </figure> | |
365 | |
366 <para id="x_30f">It's useful to think of the working directory as | |
367 <quote>the changeset I'm about to commit</quote>. Any files | |
368 that you tell Mercurial that you've added, removed, renamed, | |
369 or copied will be reflected in that changeset, as will | |
370 modifications to any files that Mercurial is already tracking; | |
371 the new changeset will have the parents of the working | |
372 directory as its parents.</para> | |
373 | |
374 <para id="x_310">After a commit, Mercurial will update the | |
375 parents of the working directory, so that the first parent is | |
376 the ID of the new changeset, and the second is the null ID. | |
377 This is shown in <xref | |
378 linkend="fig:concepts:wdir-after-commit"/>. Mercurial | |
379 doesn't touch any of the files in the working directory when | |
380 you commit; it just modifies the dirstate to note its new | |
381 parents.</para> | |
382 | |
383 </sect2> | |
384 <sect2> | |
385 <title>Creating a new head</title> | |
386 | |
387 <para id="x_311">It's perfectly normal to update the working directory to a | |
388 changeset other than the current tip. For example, you might | |
389 want to know what your project looked like last Tuesday, or | |
390 you could be looking through changesets to see which one | |
391 introduced a bug. In cases like this, the natural thing to do | |
392 is update the working directory to the changeset you're | |
393 interested in, and then examine the files in the working | |
394 directory directly to see their contents as they were when you | |
395 committed that changeset. The effect of this is shown in | |
396 <xref linkend="fig:concepts:wdir-pre-branch"/>.</para> | |
397 | |
398 <figure id="fig:concepts:wdir-pre-branch"> | |
399 <title>The working directory, updated to an older | |
400 changeset</title> | |
401 <mediaobject> | |
402 <imageobject><imagedata fileref="figs/wdir-pre-branch.png"/></imageobject> | |
403 <textobject><phrase>XXX add text</phrase></textobject> | |
404 </mediaobject> | |
405 </figure> | |
406 | |
407 <para id="x_313">Having updated the working directory to an | |
408 older changeset, what happens if you make some changes, and | |
409 then commit? Mercurial behaves in the same way as I outlined | |
410 above. The parents of the working directory become the | |
411 parents of the new changeset. This new changeset has no | |
412 children, so it becomes the new tip. And the repository now | |
413 contains two changesets that have no children; we call these | |
414 <emphasis>heads</emphasis>. You can see the structure that | |
415 this creates in <xref | |
416 linkend="fig:concepts:wdir-branch"/>.</para> | |
417 | |
418 <figure id="fig:concepts:wdir-branch"> | |
419 <title>After a commit made while synced to an older | |
420 changeset</title> | |
421 <mediaobject> | |
422 <imageobject><imagedata fileref="figs/wdir-branch.png"/></imageobject> | |
423 <textobject><phrase>XXX add text</phrase></textobject> | |
424 </mediaobject> | |
425 </figure> | |
426 | |
427 <note> | |
428 <para id="x_315">If you're new to Mercurial, you should keep | |
429 in mind a common <quote>error</quote>, which is to use the | |
430 <command role="hg-cmd">hg pull</command> command without any | |
431 options. By default, the <command role="hg-cmd">hg | |
432 pull</command> command <emphasis>does not</emphasis> | |
433 update the working directory, so you'll bring new changesets | |
434 into your repository, but the working directory will stay | |
435 synced at the same changeset as before the pull. If you | |
436 make some changes and commit afterwards, you'll thus create | |
437 a new head, because your working directory isn't synced to | |
438 whatever the current tip is. To combine the operation of a | |
439 pull, followed by an update, run <command>hg pull | |
440 -u</command>.</para> | |
441 | |
442 <para id="x_316">I put the word <quote>error</quote> in quotes | |
443 because all that you need to do to rectify the situation | |
444 where you created a new head by accident is | |
445 <command role="hg-cmd">hg merge</command>, then <command | |
446 role="hg-cmd">hg commit</command>. In other words, this | |
447 almost never has negative consequences; it's just something | |
448 of a surprise for newcomers. I'll discuss other ways to | |
449 avoid this behavior, and why Mercurial behaves in this | |
450 initially surprising way, later on.</para> | |
451 </note> | |
452 | |
453 </sect2> | |
454 <sect2> | |
455 <title>Merging changes</title> | |
456 | |
457 <para id="x_317">When you run the <command role="hg-cmd">hg | |
458 merge</command> command, Mercurial leaves the first parent | |
459 of the working directory unchanged, and sets the second parent | |
460 to the changeset you're merging with, as shown in <xref | |
461 linkend="fig:concepts:wdir-merge"/>.</para> | |
462 | |
463 <figure id="fig:concepts:wdir-merge"> | |
464 <title>Merging two heads</title> | |
465 <mediaobject> | |
466 <imageobject> | |
467 <imagedata fileref="figs/wdir-merge.png"/> | |
468 </imageobject> | |
469 <textobject><phrase>XXX add text</phrase></textobject> | |
470 </mediaobject> | |
471 </figure> | |
472 | |
473 <para id="x_319">Mercurial also has to modify the working directory, to | |
474 merge the files managed in the two changesets. Simplified a | |
475 little, the merging process goes like this, for every file in | |
476 the manifests of both changesets.</para> | |
477 <itemizedlist> | |
478 <listitem><para id="x_31a">If neither changeset has modified a file, do | |
479 nothing with that file.</para> | |
480 </listitem> | |
481 <listitem><para id="x_31b">If one changeset has modified a file, and the | |
482 other hasn't, create the modified copy of the file in the | |
483 working directory.</para> | |
484 </listitem> | |
485 <listitem><para id="x_31c">If one changeset has removed a file, and the | |
486 other hasn't (or has also deleted it), delete the file | |
487 from the working directory.</para> | |
488 </listitem> | |
489 <listitem><para id="x_31d">If one changeset has removed a file, but the | |
490 other has modified the file, ask the user what to do: keep | |
491 the modified file, or remove it?</para> | |
492 </listitem> | |
493 <listitem><para id="x_31e">If both changesets have modified a file, | |
494 invoke an external merge program to choose the new | |
495 contents for the merged file. This may require input from | |
496 the user.</para> | |
497 </listitem> | |
498 <listitem><para id="x_31f">If one changeset has modified a file, and the | |
499 other has renamed or copied the file, make sure that the | |
500 changes follow the new name of the file.</para> | |
501 </listitem></itemizedlist> | |
502 <para id="x_320">There are more details&emdash;merging has plenty of corner | |
503 cases&emdash;but these are the most common choices that are | |
504 involved in a merge. As you can see, most cases are | |
505 completely automatic, and indeed most merges finish | |
506 automatically, without requiring your input to resolve any | |
507 conflicts.</para> | |
508 | |
509 <para id="x_321">When you're thinking about what happens when you commit | |
510 after a merge, once again the working directory is <quote>the | |
511 changeset I'm about to commit</quote>. After the <command | |
512 role="hg-cmd">hg merge</command> command completes, the | |
513 working directory has two parents; these will become the | |
514 parents of the new changeset.</para> | |
515 | |
516 <para id="x_322">Mercurial lets you perform multiple merges, but | |
517 you must commit the results of each individual merge as you | |
518 go. This is necessary because Mercurial only tracks two | |
519 parents for both revisions and the working directory. While | |
520 it would be technically feasible to merge multiple changesets | |
521 at once, Mercurial avoids this for simplicity. With multi-way | |
522 merges, the risks of user confusion, nasty conflict | |
523 resolution, and making a terrible mess of a merge would grow | |
524 intolerable.</para> | |
525 | |
526 </sect2> | |
527 | |
528 <sect2> | |
529 <title>Merging and renames</title> | |
530 | |
531 <para id="x_69a">A surprising number of revision control systems pay little | |
532 or no attention to a file's <emphasis>name</emphasis> over | |
533 time. For instance, it used to be common that if a file got | |
534 renamed on one side of a merge, the changes from the other | |
535 side would be silently dropped.</para> | |
536 | |
537 <para id="x_69b">Mercurial records metadata when you tell it to perform a | |
538 rename or copy. It uses this metadata during a merge to do the | |
539 right thing in the case of a merge. For instance, if I rename | |
540 a file, and you edit it without renaming it, when we merge our | |
541 work the file will be renamed and have your edits | |
542 applied.</para> | |
543 </sect2> | |
544 </sect1> | |
545 | |
546 <sect1> | |
547 <title>Other interesting design features</title> | |
548 | |
549 <para id="x_323">In the sections above, I've tried to highlight some of the | |
550 most important aspects of Mercurial's design, to illustrate that | |
551 it pays careful attention to reliability and performance. | |
552 However, the attention to detail doesn't stop there. There are | |
553 a number of other aspects of Mercurial's construction that I | |
554 personally find interesting. I'll detail a few of them here, | |
555 separate from the <quote>big ticket</quote> items above, so that | |
556 if you're interested, you can gain a better idea of the amount | |
557 of thinking that goes into a well-designed system.</para> | |
558 | |
559 <sect2> | |
560 <title>Clever compression</title> | |
561 | |
562 <para id="x_324">When appropriate, Mercurial will store both snapshots and | |
563 deltas in compressed form. It does this by always | |
564 <emphasis>trying to</emphasis> compress a snapshot or delta, | |
565 but only storing the compressed version if it's smaller than | |
566 the uncompressed version.</para> | |
567 | |
568 <para id="x_325">This means that Mercurial does <quote>the right | |
569 thing</quote> when storing a file whose native form is | |
570 compressed, such as a <literal>zip</literal> archive or a JPEG | |
571 image. When these types of files are compressed a second | |
572 time, the resulting file is usually bigger than the | |
573 once-compressed form, and so Mercurial will store the plain | |
574 <literal>zip</literal> or JPEG.</para> | |
575 | |
576 <para id="x_326">Deltas between revisions of a compressed file are usually | |
577 larger than snapshots of the file, and Mercurial again does | |
578 <quote>the right thing</quote> in these cases. It finds that | |
579 such a delta exceeds the threshold at which it should store a | |
580 complete snapshot of the file, so it stores the snapshot, | |
581 again saving space compared to a naive delta-only | |
582 approach.</para> | |
583 | |
584 <sect3> | |
585 <title>Network recompression</title> | |
586 | |
587 <para id="x_327">When storing revisions on disk, Mercurial uses the | |
588 <quote>deflate</quote> compression algorithm (the same one | |
589 used by the popular <literal>zip</literal> archive format), | |
590 which balances good speed with a respectable compression | |
591 ratio. However, when transmitting revision data over a | |
592 network connection, Mercurial uncompresses the compressed | |
593 revision data.</para> | |
594 | |
595 <para id="x_328">If the connection is over HTTP, Mercurial recompresses | |
596 the entire stream of data using a compression algorithm that | |
597 gives a better compression ratio (the Burrows-Wheeler | |
598 algorithm from the widely used <literal>bzip2</literal> | |
599 compression package). This combination of algorithm and | |
600 compression of the entire stream (instead of a revision at a | |
601 time) substantially reduces the number of bytes to be | |
602 transferred, yielding better network performance over most | |
603 kinds of network.</para> | |
604 | |
605 <para id="x_329">If the connection is over | |
606 <command>ssh</command>, Mercurial | |
607 <emphasis>doesn't</emphasis> recompress the stream, because | |
608 <command>ssh</command> can already do this itself. You can | |
609 tell Mercurial to always use <command>ssh</command>'s | |
610 compression feature by editing the | |
611 <filename>.hgrc</filename> file in your home directory as | |
612 follows.</para> | |
613 | |
614 <programlisting>[ui] | |
615 ssh = ssh -C</programlisting> | |
616 | |
617 </sect3> | |
618 </sect2> | |
619 <sect2> | |
620 <title>Read/write ordering and atomicity</title> | |
621 | |
622 <para id="x_32a">Appending to files isn't the whole story when | |
623 it comes to guaranteeing that a reader won't see a partial | |
624 write. If you recall <xref linkend="fig:concepts:metadata"/>, | |
625 revisions in the changelog point to revisions in the manifest, | |
626 and revisions in the manifest point to revisions in filelogs. | |
627 This hierarchy is deliberate.</para> | |
628 | |
629 <para id="x_32b">A writer starts a transaction by writing filelog and | |
630 manifest data, and doesn't write any changelog data until | |
631 those are finished. A reader starts by reading changelog | |
632 data, then manifest data, followed by filelog data.</para> | |
633 | |
634 <para id="x_32c">Since the writer has always finished writing filelog and | |
635 manifest data before it writes to the changelog, a reader will | |
636 never read a pointer to a partially written manifest revision | |
637 from the changelog, and it will never read a pointer to a | |
638 partially written filelog revision from the manifest.</para> | |
639 | |
640 </sect2> | |
641 <sect2> | |
642 <title>Concurrent access</title> | |
643 | |
644 <para id="x_32d">The read/write ordering and atomicity guarantees mean that | |
645 Mercurial never needs to <emphasis>lock</emphasis> a | |
646 repository when it's reading data, even if the repository is | |
647 being written to while the read is occurring. This has a big | |
648 effect on scalability; you can have an arbitrary number of | |
649 Mercurial processes safely reading data from a repository | |
650 all at once, no matter whether it's being written to or | |
651 not.</para> | |
652 | |
653 <para id="x_32e">The lockless nature of reading means that if you're | |
654 sharing a repository on a multi-user system, you don't need to | |
655 grant other local users permission to | |
656 <emphasis>write</emphasis> to your repository in order for | |
657 them to be able to clone it or pull changes from it; they only | |
658 need <emphasis>read</emphasis> permission. (This is | |
659 <emphasis>not</emphasis> a common feature among revision | |
660 control systems, so don't take it for granted! Most require | |
661 readers to be able to lock a repository to access it safely, | |
662 and this requires write permission on at least one directory, | |
663 which of course makes for all kinds of nasty and annoying | |
664 security and administrative problems.)</para> | |
665 | |
666 <para id="x_32f">Mercurial uses locks to ensure that only one process can | |
667 write to a repository at a time (the locking mechanism is safe | |
668 even over filesystems that are notoriously hostile to locking, | |
669 such as NFS). If a repository is locked, a writer will wait | |
670 for a while to retry if the repository becomes unlocked, but | |
671 if the repository remains locked for too long, the process | |
672 attempting to write will time out after a while. This means | |
673 that your daily automated scripts won't get stuck forever and | |
674 pile up if a system crashes unnoticed, for example. (Yes, the | |
675 timeout is configurable, from zero to infinity.)</para> | |
676 | |
677 <sect3> | |
678 <title>Safe dirstate access</title> | |
679 | |
680 <para id="x_330">As with revision data, Mercurial doesn't take a lock to | |
681 read the dirstate file; it does acquire a lock to write it. | |
682 To avoid the possibility of reading a partially written copy | |
683 of the dirstate file, Mercurial writes to a file with a | |
684 unique name in the same directory as the dirstate file, then | |
685 renames the temporary file atomically to | |
686 <filename>dirstate</filename>. The file named | |
687 <filename>dirstate</filename> is thus guaranteed to be | |
688 complete, not partially written.</para> | |
689 | |
690 </sect3> | |
691 </sect2> | |
692 <sect2> | |
693 <title>Avoiding seeks</title> | |
694 | |
695 <para id="x_331">Critical to Mercurial's performance is the avoidance of | |
696 seeks of the disk head, since any seek is far more expensive | |
697 than even a comparatively large read operation.</para> | |
698 | |
699 <para id="x_332">This is why, for example, the dirstate is stored in a | |
700 single file. If there were a dirstate file per directory that | |
701 Mercurial tracked, the disk would seek once per directory. | |
702 Instead, Mercurial reads the entire single dirstate file in | |
703 one step.</para> | |
704 | |
705 <para id="x_333">Mercurial also uses a <quote>copy on write</quote> scheme | |
706 when cloning a repository on local storage. Instead of | |
707 copying every revlog file from the old repository into the new | |
708 repository, it makes a <quote>hard link</quote>, which is a | |
709 shorthand way to say <quote>these two names point to the same | |
710 file</quote>. When Mercurial is about to write to one of a | |
711 revlog's files, it checks to see if the number of names | |
712 pointing at the file is greater than one. If it is, more than | |
713 one repository is using the file, so Mercurial makes a new | |
714 copy of the file that is private to this repository.</para> | |
715 | |
716 <para id="x_334">A few revision control developers have pointed out that | |
717 this idea of making a complete private copy of a file is not | |
718 very efficient in its use of storage. While this is true, | |
719 storage is cheap, and this method gives the highest | |
720 performance while deferring most book-keeping to the operating | |
721 system. An alternative scheme would most likely reduce | |
722 performance and increase the complexity of the software, but | |
723 speed and simplicity are key to the <quote>feel</quote> of | |
724 day-to-day use.</para> | |
725 | |
726 </sect2> | |
727 <sect2> | |
728 <title>Other contents of the dirstate</title> | |
729 | |
730 <para id="x_335">Because Mercurial doesn't force you to tell it when you're | |
731 modifying a file, it uses the dirstate to store some extra | |
732 information so it can determine efficiently whether you have | |
733 modified a file. For each file in the working directory, it | |
734 stores the time that it last modified the file itself, and the | |
735 size of the file at that time.</para> | |
736 | |
737 <para id="x_336">When you explicitly <command role="hg-cmd">hg | |
738 add</command>, <command role="hg-cmd">hg remove</command>, | |
739 <command role="hg-cmd">hg rename</command> or <command | |
740 role="hg-cmd">hg copy</command> files, Mercurial updates the | |
741 dirstate so that it knows what to do with those files when you | |
742 commit.</para> | |
743 | |
744 <para id="x_337">The dirstate helps Mercurial to efficiently | |
745 check the status of files in a repository.</para> | |
746 | |
747 <itemizedlist> | |
748 <listitem> | |
749 <para id="x_726">When Mercurial checks the state of a file in the | |
750 working directory, it first checks a file's modification | |
751 time against the time in the dirstate that records when | |
752 Mercurial last wrote the file. If the last modified time | |
753 is the same as the time when Mercurial wrote the file, the | |
754 file must not have been modified, so Mercurial does not | |
755 need to check any further.</para> | |
756 </listitem> | |
757 <listitem> | |
758 <para id="x_727">If the file's size has changed, the file must have | |
759 been modified. If the modification time has changed, but | |
760 the size has not, only then does Mercurial need to | |
761 actually read the contents of the file to see if it has | |
762 changed.</para> | |
763 </listitem> | |
764 </itemizedlist> | |
765 | |
766 <para id="x_728">Storing the modification time and size dramatically | |
767 reduces the number of read operations that Mercurial needs to | |
768 perform when we run commands like <command>hg status</command>. | |
769 This results in large performance improvements.</para> | |
770 </sect2> | |
771 </sect1> | |
772 </chapter> | |
773 | |
774 <!-- | |
775 local variables: | |
776 sgml-parent-document: ("00book.xml" "book" "chapter") | |
777 end: | |
778 --> |