1773
|
1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
|
|
2 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
|
|
3 <!ENTITY html2db "<code>html2db.xsl</code>">
|
|
4 ]>
|
|
5 <html xmlns:x="http://www.w3.org/1999/xhtml"
|
|
6 xmlns:db="urn:docbook">
|
|
7 <head>
|
|
8 <title>This title is ignored</title>
|
|
9 </head>
|
|
10 <body>
|
|
11
|
|
12 <h1>html2db.xsl</h1>
|
|
13
|
|
14 <!-- The xmlns attribute escapes into the Docbook namespace -->
|
|
15 <articleinfo xmlns="urn:docbook">
|
|
16 <author>
|
|
17 <firstname>Oliver</firstname>
|
|
18 <surname>Steele</surname>
|
|
19 </author>
|
|
20 <revhistory>
|
|
21 <revision>
|
|
22 <revnumber>1</revnumber>
|
|
23 <date>2004-07-30</date>
|
|
24 </revision>
|
|
25 <revision>
|
|
26 <revnumber>1.0.1</revnumber>
|
|
27 <date>2004-08-01</date>
|
|
28 <revdescription><para>Editorial changes to the
|
|
29 readme.</para></revdescription>
|
|
30 </revision>
|
|
31 </revhistory>
|
|
32 <date>2004-07-30</date>
|
|
33 </articleinfo>
|
|
34
|
|
35 <h2>Overview</h2>
|
|
36
|
|
37 <p>&html2db; converts an XHTML source document into a Docbook output
|
|
38 document. It provides features for customizing the generation of the
|
|
39 output, so that the output can be tuned by annotating
|
|
40 the source, rather than hand-editing the output. This makes it useful
|
|
41 in a processing pipeline where the source documents are maintained in
|
|
42 HTML, although it can be used as a one-time conversion tool
|
|
43 too.</p>
|
|
44
|
|
45 <p>This document is an example of &html2db; used in conjunction with
|
|
46 the Docbook XSL stylesheets. The <a href="index.src.html">source
|
|
47 file</a> is an XHTML file with some embedded Docbook elements and
|
|
48 processing instructions. &html2db; compiles it into a <a
|
|
49 href="index.xml">Docbook document</a>, which can be used to generate
|
|
50 this output file (which includes a Table of Contents), a <a
|
|
51 href="docs/index.html">chunked HTML file</a>, a <a
|
|
52 href="html2db.pdf">PDF</a>, or other formats.</p>
|
|
53
|
|
54 <h2>Features</h2>
|
|
55 <dl>
|
|
56 <dt>XSLT implementation</dt>
|
|
57 <dd>This tool is designed to be embedded within an XSLT processing
|
|
58 pipeline. <code>html2html.xslt</code> can be used in a custom
|
|
59 stylesheet or integrated into a larger system. See <a
|
|
60 href="#embedding">Overriding</a>.</dd>
|
|
61
|
|
62 <dt>Customizable</dt>
|
|
63 <dd>The output can be customized by the means of additonal markup in
|
|
64 the XHMTL source. See the section on <a
|
|
65 href="#customization">customization</a>.</dd>
|
|
66
|
|
67 <dt>Creates outline structure</dt>
|
|
68 <dd><code>h1</code>, <code>h2</code>, etc. are turned into nested
|
|
69 <code>section</code> and <code>title</code> elements (as opposed to
|
|
70 bridge heads).</dd>
|
|
71
|
|
72 <dt>Accepts a wide variety of XHTML</dt>
|
|
73 <dd>In particular, &html2db; automatically wraps <dfn>naked item
|
|
74 text</dfn> (text that is not enclosed in a <code><p></code>)
|
|
75 inside a table cell or list item. Naked text is a common property of
|
|
76 XHTML documents, but needs to be clothed to create valid
|
|
77 Docbook.<db:footnote><p>This feature is limited. See <a
|
|
78 href="#implicit-blocks">Implicit Blocks</a>.)</p></db:footnote></dd>
|
|
79
|
|
80 </dl>
|
|
81
|
|
82 <h2>Requirements</h2>
|
|
83 <ul>
|
|
84 <li>Java: JRE or JDK 1.3 or greater.</li>
|
|
85 <li>Xalan 2.5.0.</li>
|
|
86 <li>Familiarity with installing and running JAR files.</li>
|
|
87 </ul>
|
|
88
|
|
89 <p>&html2db; might work with earlier versions of Java and Xalan, and
|
|
90 it might work with other XSLT processors such as Saxon and
|
|
91 xsltproc.</p>
|
|
92
|
|
93 <h2>License</h2>
|
|
94 <p>This software is released under the Open Source <a href="http://www.opensource.org/licenses/artistic-license.php">Artistic License</a>.</p>
|
|
95
|
|
96 <h2>Installation</h2>
|
|
97 <ul>
|
|
98 <li>Install JRE 1.3 or higher.</li>
|
|
99 <li>Install Xalan, if necessary.</li>
|
|
100 <li>Download <code>html2db-1.zip</code> from <a href="http://osteele.com/sources/html2db.zip">http://osteele.com/sources/html2db-1.zip</a>.</li>
|
|
101 <li>Unzip <code>html2db-1.zip</code>.</li>
|
|
102 </ul>
|
|
103
|
|
104 <h2>Usage</h2>
|
|
105 <p>Use Xalan to process an XHTML source file into a Docbook file:</p>
|
|
106
|
|
107 <pre class="example">
|
|
108 java org.apache.xalan.xslt.Process -XSL html2dbk.xsl -IN doc.html > doc.xml
|
|
109 </pre>
|
|
110
|
|
111 <p>See <a href="index.src.html"><code>index.src.html</code></a> for an
|
|
112 example of an input file.</p>
|
|
113
|
|
114 <p>If your source files are in HTML, not XHTML, you may find the <a
|
|
115 href="http://tidy.sourceforge.net/">Tidy</a> tool useful. This is a
|
|
116 tool that converts from HTML to XHTML, and can be added to the front
|
|
117 of your processing pipeline.</p>
|
|
118
|
|
119 <p>(If you need to process HTML and you don't know or can't figure out
|
|
120 from context what a processing pipeline is, &html2db; is probably not
|
|
121 the right tool for you, and you should look for a local XML or Java
|
|
122 guru or for a commercially supported product.)</p>
|
|
123
|
|
124 <h2>Specification</h2>
|
|
125
|
|
126 <h3>XHTML Elements</h3>
|
|
127 <p><code>code/i</code> stands for "an <code>i</code> element
|
|
128 immediately within a <code>code</code> element". This notation is
|
|
129 from XPath.</p>
|
|
130
|
|
131 <p>XHTML elements must be in the XHTML Transitional namespace,
|
|
132 <code>http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd</code>.</p>
|
|
133
|
|
134 <table>
|
|
135 <tr>
|
|
136 <th>XHTML</th>
|
|
137 <th>Docbook</th>
|
|
138 <th>Notes</th>
|
|
139 </tr>
|
|
140
|
|
141 <tr>
|
|
142 <td><code>b</code>, <code>i</code>, <code>em</code>, <code>strong</code></td>
|
|
143 <td><code>emphasis</code></td>
|
|
144 <td>The <code>role</code> attribute is the original tag name</td>
|
|
145 </tr>
|
|
146
|
|
147 <tr>
|
|
148 <td><code>dfn</code></td>
|
|
149 <td><code>glossitem</code>, and also <code>primary</code> <code>indexterm</code></td>
|
|
150 </tr>
|
|
151
|
|
152 <tr>
|
|
153 <td><code>code/i</code>, <code>tt/i</code>, <code>pre/i</code></td>
|
|
154 <td><code>replaceable</code></td>
|
|
155 <td>In practice, <code>i</code> within a monospace content is usually used to mean replaceable text. If you're using it for emphasis, use <code>em</code> instead.</td>
|
|
156 </tr>
|
|
157
|
|
158 <tr>
|
|
159 <td><code>pre</code>, <code>body/code</code></td>
|
|
160 <td><code>programlisting</code></td>
|
|
161 </tr>
|
|
162
|
|
163 <tr>
|
|
164 <td><code>img</code></td>
|
|
165 <td><code>inlinemediaobject/imageobject/imagedata</code></td>
|
|
166 <td>In an inline context.</td>
|
|
167 </tr>
|
|
168
|
|
169 <tr>
|
|
170 <td><code>img</code></td>
|
|
171 <td><code>[informal]figure/mediaobject/imageobject/imagedata</code></td>
|
|
172 <td>If it has a <code>title</code> attribute or <code>db:title</code> it's wrapped in a <code>figure</code>. Otherwise it's wrapped in an <code>informalfigure</code>.</td>
|
|
173 </tr>
|
|
174
|
|
175 <tr>
|
|
176 <td><code>table</code></td>
|
|
177 <td><code>[informal]table</code></td>
|
|
178 <td>XHTML <code>table</code> becomes Docbook <code>table</code> if it has a <code>summary</code> attribute; <code>informaltable</code> otherwise.</td>
|
|
179 </tr>
|
|
180
|
|
181 <tr>
|
|
182 <td><code>ul</code></td>
|
|
183 <td><code>itemizedlist</code></td>
|
|
184 <td>But see the processing instruction <a href="#simplelist">below</a>.</td>
|
|
185 </tr>
|
|
186 </table>
|
|
187
|
|
188
|
|
189
|
|
190 <h3>Links</h3>
|
|
191 <table summary="Link Translation">
|
|
192 <tr>
|
|
193 <th>XHTML</th>
|
|
194 <th>Docbook</th>
|
|
195 <th>Notes</th>
|
|
196 </tr>
|
|
197
|
|
198 <tr>
|
|
199 <td><code><a name="<var>name</var>"></code></td>
|
|
200 <td><code><anchor id="{$anchor-id-prefix}<var>name</var>"></code></td>
|
|
201 <td>An anchor within a <code>h<var>n</var></code> element is attached to the enclosing <code>section</code> as an <code>id</code> attribute instead.</td>
|
|
202 </tr>
|
|
203
|
|
204 <tr>
|
|
205 <td><code><a href="#<var>name</var>"></code></td>
|
|
206 <td><code><link linkend="{$anchor-id-prefix}<var>name</var>"></code></td>
|
|
207 </tr>
|
|
208
|
|
209 <tr>
|
|
210 <td><code><a href="<var>url</var>"></code></td>
|
|
211 <td><code><ulink url="<var>name</var>"></code></td>
|
|
212 </tr>
|
|
213
|
|
214 <tr>
|
|
215 <td><code><a name="mailto:<var>address</var>"></code></td>
|
|
216 <td><code><email><var>address</var></email></code></td>
|
|
217 </tr>
|
|
218
|
|
219 </table>
|
|
220
|
|
221 <h3 id="tables">Tables</h3>
|
|
222
|
|
223 <p>XHTML <code>table</code> support is minimal. &html2db; changes the
|
|
224 element names and counts the columns (this is necessary to get table
|
|
225 footnotes to span all the columns), but it does not attempt to deal
|
|
226 with tables in their full generality.</p>
|
|
227
|
|
228 <p>An XHTML <code>table</code> with a <code>summary</code> attribute
|
|
229 generates a <code>table</code>, whose <code>title</code> is the value
|
|
230 of that summary. An XHTML <code>table</code> without a
|
|
231 <code>summary</code> generates an <code>informaltable</code>.</p>
|
|
232
|
|
233 <p>Any <code>tr</code>s that contain <code>th</code>s are pulled to
|
|
234 the top of the table, and placed inside a <code>thead</code>. Other
|
|
235 <code>tr</code>s are placed inside a <code>tbody</code>. This matches
|
|
236 the commanon XHTML <code>table</code> pattern, where the first row is
|
|
237 a header row.</p>
|
|
238
|
|
239 <h3 id="implicit-blocks">Implicit Blocks</h3>
|
|
240 <p>XHTML allows <code>li</code>, <code>dd</code>, and <code>td</code>
|
|
241 elements to contain either inline text (for instance,
|
|
242 <code><li>a list item</li></code>) or block structure
|
|
243 (<code><li><p>a block</p></li></code>). The
|
|
244 corresponding Docbook elements require block structure, such as
|
|
245 <code>para</code>.</p>
|
|
246
|
|
247 <p>&html2db; provides limited support for wrapping naked text in
|
|
248 these positions in <code>para</code> elements. If a list item or
|
|
249 table cell item directly contains text, all text up to the position of
|
|
250 the first element (or all text, if there is no element) is wrapped in
|
|
251 <code>para</code>. This handles the simple case of an item that
|
|
252 directly contains text, and also the case of an item that contains
|
|
253 text followed by blocks such as paragraphs.</p>
|
|
254
|
|
255 <p>Note that this algorithm is easily confused. It doesn't
|
|
256 distinguish between block and inline XHTML elements, so it will only
|
|
257 wrap the first word in <code><li>some <b>bold</b>
|
|
258 text</li></code>, leading to badly formatted output. Twhe
|
|
259 workaround is to wrap troublesome content in explicit
|
|
260 <code><p></code> tags.</p>
|
|
261
|
|
262 <h3 id="docbook-elements">Docbook Elements</h3>
|
|
263
|
|
264 <p>Elements from the Docbook namespace are passed through as is.
|
|
265 There are two ways to include a Docbook element in your XHTML
|
|
266 source:</p>
|
|
267
|
|
268 <dl>
|
|
269 <dt>Global prefix</dt>
|
|
270 <dd><p>A <dfn>fake Docbook namespace</dfn><db:footnote><p>The fake
|
|
271 Docbook namespace is <code>urn:docbook</code>. Docbook doesn't really
|
|
272 have a namespace, and if it did, it wouldn't be this one. See <a
|
|
273 href="#docbook-namespace">Docbook namespace</a> for a discussion of
|
|
274 this issue.</p></db:footnote>
|
|
275
|
|
276 declaration may be added to the document root element. Anywhere in
|
|
277 the document, the prefix from this namespace declaration may be used
|
|
278 to include a Docbook element. This is useful if a document contains
|
|
279 many Docbook elements, such as <code>footnote</code> or
|
|
280 <code>glossterm</code>, interspersed with XHTML. (In this case it may
|
|
281 be more convenient to allow these elements in the XHMTL namespace and
|
|
282 add a customization layer that translates them to docbook elements,
|
|
283 however. See <a href="#customization">Customization</a>.)</p>
|
|
284
|
|
285 <pre class="example"><![CDATA[
|
|
286 <html xmlns="http://www.w3.org/1999/xhtml"
|
|
287 xmlns:db="urn:docbook">
|
|
288 ...
|
|
289 <p>Some text<db:footnote>and a footnote</db:footnote>.</p>
|
|
290 ]]></pre></dd>
|
|
291
|
|
292 <dt>Local namespace</dt>
|
|
293 <dd><p>A Docbook element may be introduced along with a prefix-less
|
|
294 namespace declaration. This is useful for embedding a Docbook
|
|
295 document fragment (a hierarchy of elements that all use Docbook tags)
|
|
296 within of a XHTML document.</p>
|
|
297
|
|
298 <pre class="example"><![CDATA[
|
|
299 ...
|
|
300 <articleinfo xmlns="urn:docbook">
|
|
301 <author>
|
|
302 <firstname>...</firstname>
|
|
303 ...
|
|
304 ]]></pre></dd>
|
|
305 </dl>
|
|
306
|
|
307 <p>The source to <a href="index.src.html">this document</a>
|
|
308 illustrates both of these techniques.</p>
|
|
309
|
|
310 <p class="note">Both these techniques will cause your document to be
|
|
311 invalid as XHTML. In order to validate an XHTML document that
|
|
312 contains Docbook elements, you will need to create a custom schema.
|
|
313 Technically, you then ought to place your document in a different
|
|
314 namespace, but this will cause &html2db; not to recognize it!</p>
|
|
315
|
|
316
|
|
317 <h3>Output Processing Instructions</h3>
|
|
318
|
|
319 <p>&html2db; adds a few of processing instructions to the output file.
|
|
320 The Docbook XSL stylesheets ignore these, but if you write a
|
|
321 customization layer for Docbook XSL, you can use the information in
|
|
322 these processing instructions to customize the HTML output. This can
|
|
323 be used, for example, to set the <code>a</code> <code>onclick</code>
|
|
324 and <code>target</code> attributes in the HTML files that Docbook XSL
|
|
325 creates to the same values they had in the input document.</p>
|
|
326
|
|
327 <dl>
|
|
328 <dt><code><?html2db attribute="<var>name</var>" value="<var>value</var>"?></code></dt>
|
|
329 <dd>Placed inside a link element to capture the value of the <code>a</code> <code>target</code> and <code>onclick</code> attributes. <var>name</var> is the name of the attribute (<code>target</code> or <code>onclick</code>), and <var>value</var> is its value, with <code>"</code> and <code>\</code> replaced by <code>\"</code> and <code>\\</code>, respectively.</dd>
|
|
330
|
|
331 <dt><code><?html2db element="br"?></code></dt>
|
|
332 <dd>Represents the location of an XHTML <code>br</code> element in the
|
|
333 source document.</dd>
|
|
334
|
|
335 </dl>
|
|
336
|
|
337 <p>You can also include <code><?db2html?></code> processing
|
|
338 instructions in the HTML source document, and they will be copied
|
|
339 through to the Docbook output file unchanged (as will all other
|
|
340 processing instructions).</p>
|
|
341
|
|
342
|
|
343 <h2 id="customization">Customization</h2>
|
|
344 <h3>XSLT Parameters</h3>
|
|
345 <dl>
|
|
346 <dt><code><xsl:param name="anchor-id-prefix" select="''/></code></dt>
|
|
347 <dd>Prefixed to every id generated from <code><a name=></code>
|
|
348 and <code><a href="#"></code>. This is useful to avoid
|
|
349 collisions between multiple documents that are compiled into the
|
|
350 same book. For instance, if a number of XHTML sources are assembled
|
|
351 into chapters of a book, you style each source file with a prefix of
|
|
352 <code><var>docid</var>.</code> where <var>docid</var> is a unique id
|
|
353 for each source file.</dd>
|
|
354
|
|
355 <dt><code><xsl:param name="document-root" select="'article'"/></code></dt>
|
|
356 <dd>The default document root. This can be overridden by
|
|
357 <code><?html2db class="<var>name</var>"></code> within the
|
|
358 document itself, and defaults to <code>article</code>.</dd>
|
|
359 </dl>
|
|
360
|
|
361 <h3 id="processing-instructions">Processing instructions</h3>
|
|
362 <p>Use the <code><?html2db?></code> processing instruction to
|
|
363 customize the transformation of the XHTML source to Docbook:</p>
|
|
364
|
|
365 <table>
|
|
366 <tr>
|
|
367 <th>Processing instruction</th>
|
|
368 <th>Content</th>
|
|
369 <th>Effect</th>
|
|
370 </tr>
|
|
371
|
|
372 <tr>
|
|
373 <td><code><?html2db class="<var>xxx</var>"?></code></td>
|
|
374 <td><code>body</code></td>
|
|
375 <td>Sets the output document root to <var>xxx</var>. Useful for
|
|
376 translating to <code>prefix</code>, <code>appendix</code>, or <code>chapter</code>; the default is
|
|
377 <var>$document-root</var>.</td>
|
|
378 </tr>
|
|
379
|
|
380 <tr id="simplelist">
|
|
381 <td><code><?html2db class="simplelist"?></code></td>
|
|
382 <td><code>ul</code></td>
|
|
383 <td>Creates a vertical <code>simplelist</code>.<db:footnote><db:para>Note that the
|
|
384 current implementation simply checks for the presence of <em>any</em>
|
|
385 <code>html2db</code> processing instruction.</db:para></db:footnote></td>
|
|
386 </tr>
|
|
387
|
|
388
|
|
389 <tr>
|
|
390 <td><code><?html2db rowsep="1"?></code></td>
|
|
391 <td><code>[informal]table</code></td>
|
|
392 <td>Sets the <code>rowsep</code> attribute on the generated <code>table</code>.<db:footnote><db:para>Note that the current implementation simply checks for the presence of <em>any</em> <code>html2db</code> processing instruction that begins with <code>rowsep</code>, and assumes the vlaue is <code>1</code>.</db:para></db:footnote></td>
|
|
393 </tr>
|
|
394 </table>
|
|
395
|
|
396 <h3 id="embedding">Overriding the built-in templates</h3>
|
|
397 <p>For cases where the previous techniques don't allow for enough
|
|
398 customization, you can override the builtin templates. You will need
|
|
399 to know XSLT in order to do this, and you will need to write a new
|
|
400 stylesheet that uses the <code>xsl:import</code> element to import
|
|
401 <code>html2db.xsl</code>.</p>
|
|
402
|
|
403 <p>The <a href="examples.xsl"><code>example.xsl</code></a> stylesheet
|
|
404 is an example customization layer. It recognizes the <code><div
|
|
405 class="abstract"></code> and <code><p class="note"></code>
|
|
406 classes in the <a href="index.src.html">source</a> for this document,
|
|
407 and generates the corresponding Docbook elements.</p>
|
|
408
|
|
409
|
|
410 <h2>FAQ</h2>
|
|
411 <h3>Why generate Docbook?</h3>
|
|
412 <p>The primary reason to use Docbook as an <em>output</em> format is
|
|
413 to take advantage of the Docbook XSL stylesheets. These are a
|
|
414 well-designed, well-documented set of XSL stylesheets that provide a
|
|
415 variety of publishing features that would be difficult to recreate
|
|
416 from scratch for HTML:</p>
|
|
417
|
|
418 <ul>
|
|
419 <li>Automatic Table-of-Contents generation</li>
|
|
420 <li>Automatic part, chapter, and section numbering.</li>
|
|
421 <li>Creation of single-page, multi-page, PDF, and WinHelp files from the same source document.</li>
|
|
422 <li>Navigation headers, footers, and metadata for multi-page HTML
|
|
423 documents.</li>
|
|
424 <li>Link resolution and link target text insertion across multiple pages and numbered targets.</li>
|
|
425 <li>Figure, example, and table numbering, and tables of these.</li>
|
|
426 <li>Index and glossary tools.</li>
|
|
427 </ul>
|
|
428
|
|
429 <h3>Why write in XHTML?</h3>
|
|
430
|
|
431 <p>Given that Docbook is so great, why not write in it?</p>
|
|
432
|
|
433 <p>Where there are not legacy concerns, Docbook is probably a better
|
|
434 choice for structured or technical documentation.</p>
|
|
435
|
|
436 <p>Where the only legacy concern is the documents themselves, and not
|
|
437 the tools and skill sets of documentation contributors, you should
|
|
438 consider using an (X)HMTL convertor to perform a one-time conversion
|
|
439 of your documentation source into Docbook, and then switching
|
|
440 development to the result files. You can use this stylesheet to
|
|
441 perform this conversion, or evaluate other tools, many of which are
|
|
442 probably appropriate for this purpose.</p>
|
|
443
|
|
444 <p>Often there are other legacy concerns: the availability of cheap
|
|
445 (including free) and usable HTML editors and editing modes; and the
|
|
446 fact that it's easier to teach people XHTML than Docbook. If either
|
|
447 of this is an issue in your organization, you may want to maintain
|
|
448 documentation sources in XHTML instead of Docbook</p>
|
|
449
|
|
450 <p>For example, at <a href="http://www.laszlosystems.com/">Laszlo</a>,
|
|
451 most developers contribute directly to the documentation. Requiring
|
|
452 that developers learn Docbook, or that they wait on the doc team to
|
|
453 get content into the docs, would discourage this.</p>
|
|
454
|
|
455 <h3>Why not use an existing convertor?</h3>
|
|
456
|
|
457 <p>This isn't the first (X)HTML to Docbook convertor. Why not use one
|
|
458 of the exisitng ones?</p>
|
|
459
|
|
460 <p>Each HTML to Docbook convertors that I could find had at least some
|
|
461 of the following limitations, some of which stemmed from their
|
|
462 intended use as one-time-only convertors for legacy documents:</p>
|
|
463
|
|
464 <ul>
|
|
465 <li>Many only operated on a subset of HTML, and relied upon hand
|
|
466 editing of the output to clean up mistakes. This made them impossible
|
|
467 to use as part of a processing pipeline, where the source is
|
|
468 <em>maintained</em> in XHTML.</li>
|
|
469
|
|
470 <li>There was no way to customize the output, except by (1) hand
|
|
471 editing, or (2) writing a post-processing stylesheet, which didn't
|
|
472 have access to the information in the XHTML source document.</li>
|
|
473
|
|
474 <li>Many of them were difficult or impossible to customize and
|
|
475 extend. They were closed-source, or written in Java or Perl (which I
|
|
476 find to be a difficult languages to use for customizing this kind of
|
|
477 thing) and embedded in a larger system.</li>
|
|
478
|
|
479 <li>They didn't take full advantage of the Docbook tag set and content
|
|
480 model to represent document structure. For instance, they didn't
|
|
481 generate nested <code>section</code> elements to represent
|
|
482 <code>h1</code> <code>h2</code> sequences, or <code>table</code> to
|
|
483 represent tables with <code>summary</code> attributes.</li>
|
|
484 </ul>
|
|
485
|
|
486 <h3>I got this error. What does it mean?</h3>
|
|
487 <dl>
|
|
488 <dt>Q. <code>Fatal Error! The element type "br" must be terminated by the matching end-tag "</br>".
|
|
489 </code></dt>
|
|
490 <dd>A. Your document is HTML, not <em>X</em>HTML. You need to fix it, or run it through Tidy first.</dd>
|
|
491
|
|
492 <dt>Q. My output document is empty except for the <code><?xml version="1.0" encoding="UTF-8"?></code> line.</dt>
|
|
493 <dd>A. The document is missing a namespace declaration. See the <a href="index.src.html">example</a> for an example.</dd>
|
|
494
|
|
495 <dt>Q. Some of the headers and document sections are repeated multiple times.</dt>
|
|
496 <dd>A. The document has out-of-sequence headers, such as <code>h1</code> followed by <code>h3</code> (instead of <code>h2</code>). This won't work.</dd>
|
|
497
|
|
498 <dt>Q. <code>Fatal Error! The prefix "db" for element "db:footnote" is not bound.</code></dt>
|
|
499 <dd>A. You haven't declared the <code>db</code> namespace prefix. See the <a href="index.src.html">example</a> for an example.</dd>
|
|
500
|
|
501 </dl>
|
|
502
|
|
503
|
|
504 <h2>Implementation Notes</h2>
|
|
505
|
|
506 <h3>Bugs</h3>
|
|
507 <ul>
|
|
508 <li>Improperly sequenced <code>h<var>n</var></code> (for example
|
|
509 <code>h1</code> followed by <code>h3</code>, instead of
|
|
510 <code>h2</code>) will result in duplicate text.</li>
|
|
511 </ul>
|
|
512
|
|
513
|
|
514 <h3>Limitations</h3>
|
|
515 <ul>
|
|
516 <li>The <code>id</code> attribute is only preserved for certain
|
|
517 elements (at least <code>h<var>n</var></code>, images, paragraphs, and
|
|
518 tables). It ought to be preserved for all of them.</li>
|
|
519 <li>Only the <a href="#tables">very simplest</a> table format is
|
|
520 implemented.</li>
|
|
521 <li>Always uses compact lists.</li>
|
|
522 <li>The string matching for <code><?html2b
|
|
523 class="<var>classname</var>"?></code> requires an exact match
|
|
524 (spaces and all).</li>
|
|
525 <li>The <a href="#implicit-blocks">implicit blocks</a> code is easily
|
|
526 confused, as documented in that section. This is
|
|
527 easy to fix now that I understand the difference between block and
|
|
528 inline elements (I didn't when I was implementing this), but I
|
|
529 probably won't do so until I run into the problem again.</li>
|
|
530
|
|
531 </ul>
|
|
532
|
|
533
|
|
534
|
|
535
|
|
536 <h3>Wishlist</h3>
|
|
537 <ul>
|
|
538 <li>Allow <code><html2db attribute-name="<var>name</var>"
|
|
539 value="<var>value</var>"?></code> at any position, to set arbitrary
|
|
540 Docbook attributes on the generated element.</li>
|
|
541
|
|
542 <li>Use different technique from the <a href="#docbook-elements">fake
|
|
543 namespace prefix</a> to name Docbook elements in the source, that
|
|
544 preserves the XHTML validity of the source file. For example, an
|
|
545 option transform <code><div class="db:footnote"></code> into
|
|
546 <code><footnote></code>, or to use a processing attribute
|
|
547 (<code><div><?html2db classname="footnote"?></code>).</li>
|
|
548
|
|
549 <li>Parse DC metadata from XHTML <code>html/head/meta</code>.</li>
|
|
550
|
|
551 <li>Add an option to use <code>html/head/title</code> instead of
|
|
552 <code>html/body/h1[1]</code> for top title.</li>
|
|
553
|
|
554 <li>Allow an <code>id</code> on every element.</li>
|
|
555
|
|
556 <li>Add an option to translate the XHTML <code>class</code> into a
|
|
557 Docbook <code>role</code>.</li>
|
|
558
|
|
559 <li>Preserve more of the whitespace from the source document &emdash; especially within lists and tables &emdash; in order to make it easier to debug the output document.</li>
|
|
560
|
|
561 <h3>Support</h3>
|
|
562 <p>This is a work in progress. It serves my needs, but doesn't
|
|
563 attempt to be much more general than that. If you run into anything
|
|
564 it can't handle, please send a note, or better yet, a patch, to <a
|
|
565 href="mailto:steele@osteele.com">steele@osteele.com</a>. I can't
|
|
566 promise to address problems (I have a day job too), but knowing what
|
|
567 people have run into will help my prioritize my work when I do have
|
|
568 time to work on this.</p>
|
|
569
|
|
570
|
|
571 </ul>
|
|
572
|
|
573
|
|
574 <h3>Design Notes</h3>
|
|
575 <h4 id="docbook-namespace">The Docbook Namespace</h4>
|
|
576 <p>&html2db; accepts elements in the "Docbook namespace" in XHTML
|
|
577 source. This namespace is <code>urn:docbook</code>.</p>
|
|
578
|
|
579 <p>This isn't technically correct. Docbook doesn't really have a
|
|
580 namespace, and if it did, it wouldn't be this one. <a
|
|
581 href="http://www.faqs.org/rfcs/rfc3151.html">RFC 3151</a> suggests
|
|
582 <code>urn:publicid:-:OASIS:DTD+DocBook+XML+V4.1.2:EN</code> as the
|
|
583 Docbook namespace.</p>
|
|
584
|
|
585 <p>There two problems with the RFC 3151 namespace. First, it's long
|
|
586 and hard to remember. Second, it's limited to Docbook v4.1.2 &emdash;
|
|
587 but &html2db; works with other versions of Docbook too, which would
|
|
588 presumably have other namespaces. I think it's more useful to
|
|
589 <em>under</em>specify the Docbook version in the spec for this tool.
|
|
590 Docbook itself underspecifies the version completely, by avoiding a
|
|
591 namespace at all, but when mixing Docbook and XHTML elements I find it
|
|
592 useful to be <em>more</em> specific than that.</p>
|
|
593
|
|
594 <h3>History</h3>
|
|
595 <p>The original version of &html2db; was written by <a
|
|
596 href="http://osteele.com">Oliver Steele</a>, as part of the <a
|
|
597 href="http://laszlosystems.com">Laszlo Systems, Inc.</a> documentation
|
|
598 effort. We had a set of custom stylesheets that formatted and added
|
|
599 linking information to programming-language elements such as
|
|
600 <code>classname</code> and <code>tagname</code>, and added
|
|
601 Table-of-Contents to chapter documentation and numbers examples.</p>
|
|
602
|
|
603 <p>As the documentation set grew, the doc team (John Sundman)
|
|
604 requested features such as inter-chapter navigation, callouts, and
|
|
605 index and glossary elements. I was able to beat all of these back
|
|
606 except for navigation, which seemed critical. After a few days trying
|
|
607 to implement this, I decided it would be simpler to convert the subset
|
|
608 of XHTML that we used into a subset of Docbook, and use the latter to
|
|
609 add navigation. (Once this was done, the other features came for
|
|
610 free.)</p>
|
|
611
|
|
612 <p>During my August 2004 "sabbatical", I factored the general html2db
|
|
613 code out from the Laszlo-specific code, refactored and otherwise
|
|
614 cleaned it up, and wrote this documentation.</p>
|
|
615
|
|
616 <h3>Credits</h3>
|
|
617 <p>&html2db; was written by <a href="http://osteele.com">Oliver Steele</a>, as part of the <a href="http://laszlosystems.com">Laszlo Systems, Inc.</a> documentation effort.</p>
|
|
618
|
|
619 </body>
|
|
620 </html> |