Mercurial > hgbook
comparison en/ch03-concepts.xml @ 682:28b5a5befb08
Fold preface and intro into one
author | Bryan O'Sullivan <bos@serpentine.com> |
---|---|
date | Thu, 19 Mar 2009 20:54:12 -0700 |
parents | en/ch04-concepts.xml@13513d2a128d |
children | c838b3975bc6 |
comparison
equal
deleted
inserted
replaced
681:5bfa0df6aaed | 682:28b5a5befb08 |
---|---|
1 <!-- vim: set filetype=docbkxml shiftwidth=2 autoindent expandtab tw=77 : --> | |
2 | |
3 <chapter id="chap:concepts"> | |
4 <?dbhtml filename="behind-the-scenes.html"?> | |
5 <title>Behind the scenes</title> | |
6 | |
7 <para>Unlike many revision control systems, the concepts upon which | |
8 Mercurial is built are simple enough that it's easy to understand | |
9 how the software really works. Knowing this certainly isn't | |
10 necessary, but I find it useful to have a <quote>mental | |
11 model</quote> of what's going on.</para> | |
12 | |
13 <para>This understanding gives me confidence that Mercurial has been | |
14 carefully designed to be both <emphasis>safe</emphasis> and | |
15 <emphasis>efficient</emphasis>. And just as importantly, if it's | |
16 easy for me to retain a good idea of what the software is doing | |
17 when I perform a revision control task, I'm less likely to be | |
18 surprised by its behaviour.</para> | |
19 | |
20 <para>In this chapter, we'll initially cover the core concepts | |
21 behind Mercurial's design, then continue to discuss some of the | |
22 interesting details of its implementation.</para> | |
23 | |
24 <sect1> | |
25 <title>Mercurial's historical record</title> | |
26 | |
27 <sect2> | |
28 <title>Tracking the history of a single file</title> | |
29 | |
30 <para>When Mercurial tracks modifications to a file, it stores | |
31 the history of that file in a metadata object called a | |
32 <emphasis>filelog</emphasis>. Each entry in the filelog | |
33 contains enough information to reconstruct one revision of the | |
34 file that is being tracked. Filelogs are stored as files in | |
35 the <filename role="special" | |
36 class="directory">.hg/store/data</filename> directory. A | |
37 filelog contains two kinds of information: revision data, and | |
38 an index to help Mercurial to find a revision | |
39 efficiently.</para> | |
40 | |
41 <para>A file that is large, or has a lot of history, has its | |
42 filelog stored in separate data | |
43 (<quote><literal>.d</literal></quote> suffix) and index | |
44 (<quote><literal>.i</literal></quote> suffix) files. For | |
45 small files without much history, the revision data and index | |
46 are combined in a single <quote><literal>.i</literal></quote> | |
47 file. The correspondence between a file in the working | |
48 directory and the filelog that tracks its history in the | |
49 repository is illustrated in figure <xref | |
50 linkend="fig:concepts:filelog"/>.</para> | |
51 | |
52 <informalfigure id="fig:concepts:filelog"> | |
53 <mediaobject><imageobject><imagedata | |
54 fileref="filelog"/></imageobject><textobject><phrase>XXX | |
55 add text</phrase></textobject> | |
56 <caption><para>Relationships between files in working | |
57 directory and filelogs in | |
58 repository</para></caption></mediaobject> | |
59 </informalfigure> | |
60 | |
61 </sect2> | |
62 <sect2> | |
63 <title>Managing tracked files</title> | |
64 | |
65 <para>Mercurial uses a structure called a | |
66 <emphasis>manifest</emphasis> to collect together information | |
67 about the files that it tracks. Each entry in the manifest | |
68 contains information about the files present in a single | |
69 changeset. An entry records which files are present in the | |
70 changeset, the revision of each file, and a few other pieces | |
71 of file metadata.</para> | |
72 | |
73 </sect2> | |
74 <sect2> | |
75 <title>Recording changeset information</title> | |
76 | |
77 <para>The <emphasis>changelog</emphasis> contains information | |
78 about each changeset. Each revision records who committed a | |
79 change, the changeset comment, other pieces of | |
80 changeset-related information, and the revision of the | |
81 manifest to use.</para> | |
82 | |
83 </sect2> | |
84 <sect2> | |
85 <title>Relationships between revisions</title> | |
86 | |
87 <para>Within a changelog, a manifest, or a filelog, each | |
88 revision stores a pointer to its immediate parent (or to its | |
89 two parents, if it's a merge revision). As I mentioned above, | |
90 there are also relationships between revisions | |
91 <emphasis>across</emphasis> these structures, and they are | |
92 hierarchical in nature.</para> | |
93 | |
94 <para>For every changeset in a repository, there is exactly one | |
95 revision stored in the changelog. Each revision of the | |
96 changelog contains a pointer to a single revision of the | |
97 manifest. A revision of the manifest stores a pointer to a | |
98 single revision of each filelog tracked when that changeset | |
99 was created. These relationships are illustrated in figure | |
100 <xref linkend="fig:concepts:metadata"/>.</para> | |
101 | |
102 <informalfigure id="fig:concepts:metadata"> | |
103 <mediaobject><imageobject><imagedata | |
104 fileref="metadata"/></imageobject><textobject><phrase>XXX | |
105 add text</phrase></textobject><caption><para>Metadata | |
106 relationships</para></caption> | |
107 </mediaobject> | |
108 </informalfigure> | |
109 | |
110 <para>As the illustration shows, there is | |
111 <emphasis>not</emphasis> a <quote>one to one</quote> | |
112 relationship between revisions in the changelog, manifest, or | |
113 filelog. If the manifest hasn't changed between two | |
114 changesets, the changelog entries for those changesets will | |
115 point to the same revision of the manifest. If a file that | |
116 Mercurial tracks hasn't changed between two changesets, the | |
117 entry for that file in the two revisions of the manifest will | |
118 point to the same revision of its filelog.</para> | |
119 | |
120 </sect2> | |
121 </sect1> | |
122 <sect1> | |
123 <title>Safe, efficient storage</title> | |
124 | |
125 <para>The underpinnings of changelogs, manifests, and filelogs are | |
126 provided by a single structure called the | |
127 <emphasis>revlog</emphasis>.</para> | |
128 | |
129 <sect2> | |
130 <title>Efficient storage</title> | |
131 | |
132 <para>The revlog provides efficient storage of revisions using a | |
133 <emphasis>delta</emphasis> mechanism. Instead of storing a | |
134 complete copy of a file for each revision, it stores the | |
135 changes needed to transform an older revision into the new | |
136 revision. For many kinds of file data, these deltas are | |
137 typically a fraction of a percent of the size of a full copy | |
138 of a file.</para> | |
139 | |
140 <para>Some obsolete revision control systems can only work with | |
141 deltas of text files. They must either store binary files as | |
142 complete snapshots or encoded into a text representation, both | |
143 of which are wasteful approaches. Mercurial can efficiently | |
144 handle deltas of files with arbitrary binary contents; it | |
145 doesn't need to treat text as special.</para> | |
146 | |
147 </sect2> | |
148 <sect2 id="sec:concepts:txn"> | |
149 <title>Safe operation</title> | |
150 | |
151 <para>Mercurial only ever <emphasis>appends</emphasis> data to | |
152 the end of a revlog file. It never modifies a section of a | |
153 file after it has written it. This is both more robust and | |
154 efficient than schemes that need to modify or rewrite | |
155 data.</para> | |
156 | |
157 <para>In addition, Mercurial treats every write as part of a | |
158 <emphasis>transaction</emphasis> that can span a number of | |
159 files. A transaction is <emphasis>atomic</emphasis>: either | |
160 the entire transaction succeeds and its effects are all | |
161 visible to readers in one go, or the whole thing is undone. | |
162 This guarantee of atomicity means that if you're running two | |
163 copies of Mercurial, where one is reading data and one is | |
164 writing it, the reader will never see a partially written | |
165 result that might confuse it.</para> | |
166 | |
167 <para>The fact that Mercurial only appends to files makes it | |
168 easier to provide this transactional guarantee. The easier it | |
169 is to do stuff like this, the more confident you should be | |
170 that it's done correctly.</para> | |
171 | |
172 </sect2> | |
173 <sect2> | |
174 <title>Fast retrieval</title> | |
175 | |
176 <para>Mercurial cleverly avoids a pitfall common to all earlier | |
177 revision control systems: the problem of <emphasis>inefficient | |
178 retrieval</emphasis>. Most revision control systems store | |
179 the contents of a revision as an incremental series of | |
180 modifications against a <quote>snapshot</quote>. To | |
181 reconstruct a specific revision, you must first read the | |
182 snapshot, and then every one of the revisions between the | |
183 snapshot and your target revision. The more history that a | |
184 file accumulates, the more revisions you must read, hence the | |
185 longer it takes to reconstruct a particular revision.</para> | |
186 | |
187 <informalfigure id="fig:concepts:snapshot"> | |
188 <mediaobject><imageobject><imagedata | |
189 fileref="snapshot"/></imageobject><textobject><phrase>XXX | |
190 add text</phrase></textobject><caption><para>Snapshot of | |
191 a revlog, with incremental | |
192 deltas</para></caption></mediaobject> | |
193 </informalfigure> | |
194 | |
195 <para>The innovation that Mercurial applies to this problem is | |
196 simple but effective. Once the cumulative amount of delta | |
197 information stored since the last snapshot exceeds a fixed | |
198 threshold, it stores a new snapshot (compressed, of course), | |
199 instead of another delta. This makes it possible to | |
200 reconstruct <emphasis>any</emphasis> revision of a file | |
201 quickly. This approach works so well that it has since been | |
202 copied by several other revision control systems.</para> | |
203 | |
204 <para>Figure <xref linkend="fig:concepts:snapshot"/> illustrates | |
205 the idea. In an entry in a revlog's index file, Mercurial | |
206 stores the range of entries from the data file that it must | |
207 read to reconstruct a particular revision.</para> | |
208 | |
209 <sect3> | |
210 <title>Aside: the influence of video compression</title> | |
211 | |
212 <para>If you're familiar with video compression or have ever | |
213 watched a TV feed through a digital cable or satellite | |
214 service, you may know that most video compression schemes | |
215 store each frame of video as a delta against its predecessor | |
216 frame. In addition, these schemes use <quote>lossy</quote> | |
217 compression techniques to increase the compression ratio, so | |
218 visual errors accumulate over the course of a number of | |
219 inter-frame deltas.</para> | |
220 | |
221 <para>Because it's possible for a video stream to <quote>drop | |
222 out</quote> occasionally due to signal glitches, and to | |
223 limit the accumulation of artefacts introduced by the lossy | |
224 compression process, video encoders periodically insert a | |
225 complete frame (called a <quote>key frame</quote>) into the | |
226 video stream; the next delta is generated against that | |
227 frame. This means that if the video signal gets | |
228 interrupted, it will resume once the next key frame is | |
229 received. Also, the accumulation of encoding errors | |
230 restarts anew with each key frame.</para> | |
231 | |
232 </sect3> | |
233 </sect2> | |
234 <sect2> | |
235 <title>Identification and strong integrity</title> | |
236 | |
237 <para>Along with delta or snapshot information, a revlog entry | |
238 contains a cryptographic hash of the data that it represents. | |
239 This makes it difficult to forge the contents of a revision, | |
240 and easy to detect accidental corruption.</para> | |
241 | |
242 <para>Hashes provide more than a mere check against corruption; | |
243 they are used as the identifiers for revisions. The changeset | |
244 identification hashes that you see as an end user are from | |
245 revisions of the changelog. Although filelogs and the | |
246 manifest also use hashes, Mercurial only uses these behind the | |
247 scenes.</para> | |
248 | |
249 <para>Mercurial verifies that hashes are correct when it | |
250 retrieves file revisions and when it pulls changes from | |
251 another repository. If it encounters an integrity problem, it | |
252 will complain and stop whatever it's doing.</para> | |
253 | |
254 <para>In addition to the effect it has on retrieval efficiency, | |
255 Mercurial's use of periodic snapshots makes it more robust | |
256 against partial data corruption. If a revlog becomes partly | |
257 corrupted due to a hardware error or system bug, it's often | |
258 possible to reconstruct some or most revisions from the | |
259 uncorrupted sections of the revlog, both before and after the | |
260 corrupted section. This would not be possible with a | |
261 delta-only storage model.</para> | |
262 | |
263 </sect2> | |
264 </sect1> | |
265 <sect1> | |
266 <title>Revision history, branching, and merging</title> | |
267 | |
268 <para>Every entry in a Mercurial revlog knows the identity of its | |
269 immediate ancestor revision, usually referred to as its | |
270 <emphasis>parent</emphasis>. In fact, a revision contains room | |
271 for not one parent, but two. Mercurial uses a special hash, | |
272 called the <quote>null ID</quote>, to represent the idea | |
273 <quote>there is no parent here</quote>. This hash is simply a | |
274 string of zeroes.</para> | |
275 | |
276 <para>In figure <xref linkend="fig:concepts:revlog"/>, you can see | |
277 an example of the conceptual structure of a revlog. Filelogs, | |
278 manifests, and changelogs all have this same structure; they | |
279 differ only in the kind of data stored in each delta or | |
280 snapshot.</para> | |
281 | |
282 <para>The first revision in a revlog (at the bottom of the image) | |
283 has the null ID in both of its parent slots. For a | |
284 <quote>normal</quote> revision, its first parent slot contains | |
285 the ID of its parent revision, and its second contains the null | |
286 ID, indicating that the revision has only one real parent. Any | |
287 two revisions that have the same parent ID are branches. A | |
288 revision that represents a merge between branches has two normal | |
289 revision IDs in its parent slots.</para> | |
290 | |
291 <informalfigure id="fig:concepts:revlog"> | |
292 <mediaobject><imageobject><imagedata | |
293 fileref="revlog"/></imageobject><textobject><phrase>XXX | |
294 add text</phrase></textobject></mediaobject> | |
295 </informalfigure> | |
296 | |
297 </sect1> | |
298 <sect1> | |
299 <title>The working directory</title> | |
300 | |
301 <para>In the working directory, Mercurial stores a snapshot of the | |
302 files from the repository as of a particular changeset.</para> | |
303 | |
304 <para>The working directory <quote>knows</quote> which changeset | |
305 it contains. When you update the working directory to contain a | |
306 particular changeset, Mercurial looks up the appropriate | |
307 revision of the manifest to find out which files it was tracking | |
308 at the time that changeset was committed, and which revision of | |
309 each file was then current. It then recreates a copy of each of | |
310 those files, with the same contents it had when the changeset | |
311 was committed.</para> | |
312 | |
313 <para>The <emphasis>dirstate</emphasis> contains Mercurial's | |
314 knowledge of the working directory. This details which | |
315 changeset the working directory is updated to, and all of the | |
316 files that Mercurial is tracking in the working | |
317 directory.</para> | |
318 | |
319 <para>Just as a revision of a revlog has room for two parents, so | |
320 that it can represent either a normal revision (with one parent) | |
321 or a merge of two earlier revisions, the dirstate has slots for | |
322 two parents. When you use the <command role="hg-cmd">hg | |
323 update</command> command, the changeset that you update to is | |
324 stored in the <quote>first parent</quote> slot, and the null ID | |
325 in the second. When you <command role="hg-cmd">hg | |
326 merge</command> with another changeset, the first parent | |
327 remains unchanged, and the second parent is filled in with the | |
328 changeset you're merging with. The <command role="hg-cmd">hg | |
329 parents</command> command tells you what the parents of the | |
330 dirstate are.</para> | |
331 | |
332 <sect2> | |
333 <title>What happens when you commit</title> | |
334 | |
335 <para>The dirstate stores parent information for more than just | |
336 book-keeping purposes. Mercurial uses the parents of the | |
337 dirstate as <emphasis>the parents of a new | |
338 changeset</emphasis> when you perform a commit.</para> | |
339 | |
340 <informalfigure id="fig:concepts:wdir"> | |
341 <mediaobject><imageobject><imagedata | |
342 fileref="wdir"/></imageobject><textobject><phrase>XXX | |
343 add text</phrase></textobject><caption><para>The working | |
344 directory can have two | |
345 parents</para></caption></mediaobject> | |
346 </informalfigure> | |
347 | |
348 <para>Figure <xref linkend="fig:concepts:wdir"/> shows the | |
349 normal state of the working directory, where it has a single | |
350 changeset as parent. That changeset is the | |
351 <emphasis>tip</emphasis>, the newest changeset in the | |
352 repository that has no children.</para> | |
353 | |
354 <informalfigure id="fig:concepts:wdir-after-commit"> | |
355 <mediaobject><imageobject><imagedata | |
356 fileref="wdir-after-commit"/></imageobject><textobject><phrase>XXX | |
357 add text</phrase></textobject><caption><para>The working | |
358 directory gains new parents after a | |
359 commit</para></caption></mediaobject> | |
360 </informalfigure> | |
361 | |
362 <para>It's useful to think of the working directory as | |
363 <quote>the changeset I'm about to commit</quote>. Any files | |
364 that you tell Mercurial that you've added, removed, renamed, | |
365 or copied will be reflected in that changeset, as will | |
366 modifications to any files that Mercurial is already tracking; | |
367 the new changeset will have the parents of the working | |
368 directory as its parents.</para> | |
369 | |
370 <para>After a commit, Mercurial will update the parents of the | |
371 working directory, so that the first parent is the ID of the | |
372 new changeset, and the second is the null ID. This is shown | |
373 in figure <xref linkend="fig:concepts:wdir-after-commit"/>. | |
374 Mercurial | |
375 doesn't touch any of the files in the working directory when | |
376 you commit; it just modifies the dirstate to note its new | |
377 parents.</para> | |
378 | |
379 </sect2> | |
380 <sect2> | |
381 <title>Creating a new head</title> | |
382 | |
383 <para>It's perfectly normal to update the working directory to a | |
384 changeset other than the current tip. For example, you might | |
385 want to know what your project looked like last Tuesday, or | |
386 you could be looking through changesets to see which one | |
387 introduced a bug. In cases like this, the natural thing to do | |
388 is update the working directory to the changeset you're | |
389 interested in, and then examine the files in the working | |
390 directory directly to see their contents as they were when you | |
391 committed that changeset. The effect of this is shown in | |
392 figure <xref linkend="fig:concepts:wdir-pre-branch"/>.</para> | |
393 | |
394 <informalfigure id="fig:concepts:wdir-pre-branch"> | |
395 <mediaobject><imageobject><imagedata | |
396 fileref="wdir-pre-branch"/></imageobject><textobject><phrase>XXX | |
397 add text</phrase></textobject><caption><para>The working | |
398 directory, updated to an older | |
399 changeset</para></caption></mediaobject> | |
400 </informalfigure> | |
401 | |
402 <para>Having updated the working directory to an older | |
403 changeset, what happens if you make some changes, and then | |
404 commit? Mercurial behaves in the same way as I outlined | |
405 above. The parents of the working directory become the | |
406 parents of the new changeset. This new changeset has no | |
407 children, so it becomes the new tip. And the repository now | |
408 contains two changesets that have no children; we call these | |
409 <emphasis>heads</emphasis>. You can see the structure that | |
410 this creates in figure <xref | |
411 linkend="fig:concepts:wdir-branch"/>.</para> | |
412 | |
413 <informalfigure id="fig:concepts:wdir-branch"> | |
414 <mediaobject><imageobject><imagedata | |
415 fileref="wdir-branch"/></imageobject><textobject><phrase>XXX | |
416 add text</phrase></textobject><caption><para>After a | |
417 commit made while synced to an older | |
418 changeset</para></caption></mediaobject> | |
419 </informalfigure> | |
420 | |
421 <note> | |
422 <para> If you're new to Mercurial, you should keep in mind a | |
423 common <quote>error</quote>, which is to use the <command | |
424 role="hg-cmd">hg pull</command> command without any | |
425 options. By default, the <command role="hg-cmd">hg | |
426 pull</command> command <emphasis>does not</emphasis> | |
427 update the working directory, so you'll bring new changesets | |
428 into your repository, but the working directory will stay | |
429 synced at the same changeset as before the pull. If you | |
430 make some changes and commit afterwards, you'll thus create | |
431 a new head, because your working directory isn't synced to | |
432 whatever the current tip is.</para> | |
433 | |
434 <para> I put the word <quote>error</quote> in quotes because | |
435 all that you need to do to rectify this situation is | |
436 <command role="hg-cmd">hg merge</command>, then <command | |
437 role="hg-cmd">hg commit</command>. In other words, this | |
438 almost never has negative consequences; it just surprises | |
439 people. I'll discuss other ways to avoid this behaviour, | |
440 and why Mercurial behaves in this initially surprising way, | |
441 later on.</para> | |
442 </note> | |
443 | |
444 </sect2> | |
445 <sect2> | |
446 <title>Merging heads</title> | |
447 | |
448 <para>When you run the <command role="hg-cmd">hg merge</command> | |
449 command, Mercurial leaves the first parent of the working | |
450 directory unchanged, and sets the second parent to the | |
451 changeset you're merging with, as shown in figure <xref | |
452 linkend="fig:concepts:wdir-merge"/>.</para> | |
453 | |
454 <informalfigure id="fig:concepts:wdir-merge"> | |
455 <mediaobject><imageobject><imagedata | |
456 fileref="wdir-merge"/></imageobject><textobject><phrase>XXX | |
457 add text</phrase></textobject><caption><para>Merging two | |
458 heads</para></caption></mediaobject> | |
459 </informalfigure> | |
460 | |
461 <para>Mercurial also has to modify the working directory, to | |
462 merge the files managed in the two changesets. Simplified a | |
463 little, the merging process goes like this, for every file in | |
464 the manifests of both changesets.</para> | |
465 <itemizedlist> | |
466 <listitem><para>If neither changeset has modified a file, do | |
467 nothing with that file.</para> | |
468 </listitem> | |
469 <listitem><para>If one changeset has modified a file, and the | |
470 other hasn't, create the modified copy of the file in the | |
471 working directory.</para> | |
472 </listitem> | |
473 <listitem><para>If one changeset has removed a file, and the | |
474 other hasn't (or has also deleted it), delete the file | |
475 from the working directory.</para> | |
476 </listitem> | |
477 <listitem><para>If one changeset has removed a file, but the | |
478 other has modified the file, ask the user what to do: keep | |
479 the modified file, or remove it?</para> | |
480 </listitem> | |
481 <listitem><para>If both changesets have modified a file, | |
482 invoke an external merge program to choose the new | |
483 contents for the merged file. This may require input from | |
484 the user.</para> | |
485 </listitem> | |
486 <listitem><para>If one changeset has modified a file, and the | |
487 other has renamed or copied the file, make sure that the | |
488 changes follow the new name of the file.</para> | |
489 </listitem></itemizedlist> | |
490 <para>There are more details&emdash;merging has plenty of corner | |
491 cases&emdash;but these are the most common choices that are | |
492 involved in a merge. As you can see, most cases are | |
493 completely automatic, and indeed most merges finish | |
494 automatically, without requiring your input to resolve any | |
495 conflicts.</para> | |
496 | |
497 <para>When you're thinking about what happens when you commit | |
498 after a merge, once again the working directory is <quote>the | |
499 changeset I'm about to commit</quote>. After the <command | |
500 role="hg-cmd">hg merge</command> command completes, the | |
501 working directory has two parents; these will become the | |
502 parents of the new changeset.</para> | |
503 | |
504 <para>Mercurial lets you perform multiple merges, but you must | |
505 commit the results of each individual merge as you go. This | |
506 is necessary because Mercurial only tracks two parents for | |
507 both revisions and the working directory. While it would be | |
508 technically possible to merge multiple changesets at once, the | |
509 prospect of user confusion and making a terrible mess of a | |
510 merge immediately becomes overwhelming.</para> | |
511 | |
512 </sect2> | |
513 </sect1> | |
514 <sect1> | |
515 <title>Other interesting design features</title> | |
516 | |
517 <para>In the sections above, I've tried to highlight some of the | |
518 most important aspects of Mercurial's design, to illustrate that | |
519 it pays careful attention to reliability and performance. | |
520 However, the attention to detail doesn't stop there. There are | |
521 a number of other aspects of Mercurial's construction that I | |
522 personally find interesting. I'll detail a few of them here, | |
523 separate from the <quote>big ticket</quote> items above, so that | |
524 if you're interested, you can gain a better idea of the amount | |
525 of thinking that goes into a well-designed system.</para> | |
526 | |
527 <sect2> | |
528 <title>Clever compression</title> | |
529 | |
530 <para>When appropriate, Mercurial will store both snapshots and | |
531 deltas in compressed form. It does this by always | |
532 <emphasis>trying to</emphasis> compress a snapshot or delta, | |
533 but only storing the compressed version if it's smaller than | |
534 the uncompressed version.</para> | |
535 | |
536 <para>This means that Mercurial does <quote>the right | |
537 thing</quote> when storing a file whose native form is | |
538 compressed, such as a <literal>zip</literal> archive or a JPEG | |
539 image. When these types of files are compressed a second | |
540 time, the resulting file is usually bigger than the | |
541 once-compressed form, and so Mercurial will store the plain | |
542 <literal>zip</literal> or JPEG.</para> | |
543 | |
544 <para>Deltas between revisions of a compressed file are usually | |
545 larger than snapshots of the file, and Mercurial again does | |
546 <quote>the right thing</quote> in these cases. It finds that | |
547 such a delta exceeds the threshold at which it should store a | |
548 complete snapshot of the file, so it stores the snapshot, | |
549 again saving space compared to a naive delta-only | |
550 approach.</para> | |
551 | |
552 <sect3> | |
553 <title>Network recompression</title> | |
554 | |
555 <para>When storing revisions on disk, Mercurial uses the | |
556 <quote>deflate</quote> compression algorithm (the same one | |
557 used by the popular <literal>zip</literal> archive format), | |
558 which balances good speed with a respectable compression | |
559 ratio. However, when transmitting revision data over a | |
560 network connection, Mercurial uncompresses the compressed | |
561 revision data.</para> | |
562 | |
563 <para>If the connection is over HTTP, Mercurial recompresses | |
564 the entire stream of data using a compression algorithm that | |
565 gives a better compression ratio (the Burrows-Wheeler | |
566 algorithm from the widely used <literal>bzip2</literal> | |
567 compression package). This combination of algorithm and | |
568 compression of the entire stream (instead of a revision at a | |
569 time) substantially reduces the number of bytes to be | |
570 transferred, yielding better network performance over almost | |
571 all kinds of network.</para> | |
572 | |
573 <para>(If the connection is over <command>ssh</command>, | |
574 Mercurial <emphasis>doesn't</emphasis> recompress the | |
575 stream, because <command>ssh</command> can already do this | |
576 itself.)</para> | |
577 | |
578 </sect3> | |
579 </sect2> | |
580 <sect2> | |
581 <title>Read/write ordering and atomicity</title> | |
582 | |
583 <para>Appending to files isn't the whole story when it comes to | |
584 guaranteeing that a reader won't see a partial write. If you | |
585 recall figure <xref linkend="fig:concepts:metadata"/>, | |
586 revisions in the | |
587 changelog point to revisions in the manifest, and revisions in | |
588 the manifest point to revisions in filelogs. This hierarchy | |
589 is deliberate.</para> | |
590 | |
591 <para>A writer starts a transaction by writing filelog and | |
592 manifest data, and doesn't write any changelog data until | |
593 those are finished. A reader starts by reading changelog | |
594 data, then manifest data, followed by filelog data.</para> | |
595 | |
596 <para>Since the writer has always finished writing filelog and | |
597 manifest data before it writes to the changelog, a reader will | |
598 never read a pointer to a partially written manifest revision | |
599 from the changelog, and it will never read a pointer to a | |
600 partially written filelog revision from the manifest.</para> | |
601 | |
602 </sect2> | |
603 <sect2> | |
604 <title>Concurrent access</title> | |
605 | |
606 <para>The read/write ordering and atomicity guarantees mean that | |
607 Mercurial never needs to <emphasis>lock</emphasis> a | |
608 repository when it's reading data, even if the repository is | |
609 being written to while the read is occurring. This has a big | |
610 effect on scalability; you can have an arbitrary number of | |
611 Mercurial processes safely reading data from a repository | |
612 safely all at once, no matter whether it's being written to or | |
613 not.</para> | |
614 | |
615 <para>The lockless nature of reading means that if you're | |
616 sharing a repository on a multi-user system, you don't need to | |
617 grant other local users permission to | |
618 <emphasis>write</emphasis> to your repository in order for | |
619 them to be able to clone it or pull changes from it; they only | |
620 need <emphasis>read</emphasis> permission. (This is | |
621 <emphasis>not</emphasis> a common feature among revision | |
622 control systems, so don't take it for granted! Most require | |
623 readers to be able to lock a repository to access it safely, | |
624 and this requires write permission on at least one directory, | |
625 which of course makes for all kinds of nasty and annoying | |
626 security and administrative problems.)</para> | |
627 | |
628 <para>Mercurial uses locks to ensure that only one process can | |
629 write to a repository at a time (the locking mechanism is safe | |
630 even over filesystems that are notoriously hostile to locking, | |
631 such as NFS). If a repository is locked, a writer will wait | |
632 for a while to retry if the repository becomes unlocked, but | |
633 if the repository remains locked for too long, the process | |
634 attempting to write will time out after a while. This means | |
635 that your daily automated scripts won't get stuck forever and | |
636 pile up if a system crashes unnoticed, for example. (Yes, the | |
637 timeout is configurable, from zero to infinity.)</para> | |
638 | |
639 <sect3> | |
640 <title>Safe dirstate access</title> | |
641 | |
642 <para>As with revision data, Mercurial doesn't take a lock to | |
643 read the dirstate file; it does acquire a lock to write it. | |
644 To avoid the possibility of reading a partially written copy | |
645 of the dirstate file, Mercurial writes to a file with a | |
646 unique name in the same directory as the dirstate file, then | |
647 renames the temporary file atomically to | |
648 <filename>dirstate</filename>. The file named | |
649 <filename>dirstate</filename> is thus guaranteed to be | |
650 complete, not partially written.</para> | |
651 | |
652 </sect3> | |
653 </sect2> | |
654 <sect2> | |
655 <title>Avoiding seeks</title> | |
656 | |
657 <para>Critical to Mercurial's performance is the avoidance of | |
658 seeks of the disk head, since any seek is far more expensive | |
659 than even a comparatively large read operation.</para> | |
660 | |
661 <para>This is why, for example, the dirstate is stored in a | |
662 single file. If there were a dirstate file per directory that | |
663 Mercurial tracked, the disk would seek once per directory. | |
664 Instead, Mercurial reads the entire single dirstate file in | |
665 one step.</para> | |
666 | |
667 <para>Mercurial also uses a <quote>copy on write</quote> scheme | |
668 when cloning a repository on local storage. Instead of | |
669 copying every revlog file from the old repository into the new | |
670 repository, it makes a <quote>hard link</quote>, which is a | |
671 shorthand way to say <quote>these two names point to the same | |
672 file</quote>. When Mercurial is about to write to one of a | |
673 revlog's files, it checks to see if the number of names | |
674 pointing at the file is greater than one. If it is, more than | |
675 one repository is using the file, so Mercurial makes a new | |
676 copy of the file that is private to this repository.</para> | |
677 | |
678 <para>A few revision control developers have pointed out that | |
679 this idea of making a complete private copy of a file is not | |
680 very efficient in its use of storage. While this is true, | |
681 storage is cheap, and this method gives the highest | |
682 performance while deferring most book-keeping to the operating | |
683 system. An alternative scheme would most likely reduce | |
684 performance and increase the complexity of the software, each | |
685 of which is much more important to the <quote>feel</quote> of | |
686 day-to-day use.</para> | |
687 | |
688 </sect2> | |
689 <sect2> | |
690 <title>Other contents of the dirstate</title> | |
691 | |
692 <para>Because Mercurial doesn't force you to tell it when you're | |
693 modifying a file, it uses the dirstate to store some extra | |
694 information so it can determine efficiently whether you have | |
695 modified a file. For each file in the working directory, it | |
696 stores the time that it last modified the file itself, and the | |
697 size of the file at that time.</para> | |
698 | |
699 <para>When you explicitly <command role="hg-cmd">hg | |
700 add</command>, <command role="hg-cmd">hg remove</command>, | |
701 <command role="hg-cmd">hg rename</command> or <command | |
702 role="hg-cmd">hg copy</command> files, Mercurial updates the | |
703 dirstate so that it knows what to do with those files when you | |
704 commit.</para> | |
705 | |
706 <para>When Mercurial is checking the states of files in the | |
707 working directory, it first checks a file's modification time. | |
708 If that has not changed, the file must not have been modified. | |
709 If the file's size has changed, the file must have been | |
710 modified. If the modification time has changed, but the size | |
711 has not, only then does Mercurial need to read the actual | |
712 contents of the file to see if they've changed. Storing these | |
713 few extra pieces of information dramatically reduces the | |
714 amount of data that Mercurial needs to read, which yields | |
715 large performance improvements compared to other revision | |
716 control systems.</para> | |
717 | |
718 </sect2> | |
719 </sect1> | |
720 </chapter> | |
721 | |
722 <!-- | |
723 local variables: | |
724 sgml-parent-document: ("00book.xml" "book" "chapter") | |
725 end: | |
726 --> |