comparison doc/wiki2docbook/html2db/index.xml @ 1734:b92fc3c922ac

scripts for converting wiki documentation to docbook
author nadvornik
date Sun, 22 Nov 2009 09:12:22 +0000
parents
children
comparison
equal deleted inserted replaced
1733:91a65afb5d77 1734:b92fc3c922ac
1 <?xml version="1.0" encoding="UTF-8"?>
2 <article>
3
4 <title>html2db.xsl</title>
5
6
7 <articleinfo>
8 <author>
9 <firstname>Oliver</firstname>
10 <surname>Steele</surname>
11 </author>
12 <revhistory>
13 <revision>
14 <revnumber>1</revnumber>
15 <date>2004-07-30</date>
16 </revision>
17 <revision>
18 <revnumber>1.0.1</revnumber>
19 <date>2004-08-01</date>
20 <revdescription><para>Editorial changes to the
21 readme.</para></revdescription>
22 </revision>
23 </revhistory>
24 <date>2004-07-30</date>
25 </articleinfo>
26
27 <para/><section><title>Overview</title>
28
29 <para><literal>html2db.xsl</literal> converts an XHTML source document into a Docbook output
30 document. It provides features for customizing the generation of the
31 output, so that the output can be tuned by annotating
32 the source, rather than hand-editing the output. This makes it useful
33 in a processing pipeline where the source documents are maintained in
34 HTML, although it can be used as a one-time conversion tool
35 too.</para>
36
37 <para>This document is an example of <literal>html2db.xsl</literal> used in conjunction with
38 the Docbook XSL stylesheets. The <ulink url="index.src.html">source
39 file</ulink> is an XHTML file with some embedded Docbook elements and
40 processing instructions. <literal>html2db.xsl</literal> compiles it into a <ulink url="index.xml">Docbook document</ulink>, which can be used to generate
41 this output file (which includes a Table of Contents), a <ulink url="docs/index.html">chunked HTML file</ulink>, a <ulink url="html2db.pdf">PDF</ulink>, or other formats.</para>
42
43 <para/></section><section><title>Features</title>
44 <variablelist><varlistentry><term>XSLT implementation</term><listitem><para>This tool is designed to be embedded within an XSLT processing
45 pipeline. <literal>html2html.xslt</literal> can be used in a custom
46 stylesheet or integrated into a larger system. See <link linkend="embedding">Overriding</link>.</para></listitem></varlistentry><varlistentry><term>Customizable</term><listitem><para>The output can be customized by the means of additonal markup in
47 the XHMTL source. See the section on <link linkend="customization">customization</link>.</para></listitem></varlistentry><varlistentry><term>Creates outline structure</term><listitem><para><literal>h1</literal>, <literal>h2</literal>, etc. are turned into nested
48 <literal>section</literal> and <literal>title</literal> elements (as opposed to
49 bridge heads).</para></listitem></varlistentry><varlistentry><term>Accepts a wide variety of XHTML</term><listitem><para>In particular, <literal>html2db.xsl</literal> automatically wraps <indexterm significance="preferred"><primary>naked item
50 text</primary></indexterm><glossterm>naked item
51 text</glossterm> (text that is not enclosed in a <literal>&lt;p&gt;</literal>)
52 inside a table cell or list item. Naked text is a common property of
53 XHTML documents, but needs to be clothed to create valid
54 Docbook.<footnote><para>This feature is limited. See <link linkend="implicit-blocks">Implicit Blocks</link>.)</para></footnote></para></listitem></varlistentry></variablelist>
55
56 <para/></section><section><title>Requirements</title>
57 <itemizedlist spacing="compact"><listitem><para>Java: JRE or JDK 1.3 or greater.</para></listitem><listitem><para>Xalan 2.5.0.</para></listitem><listitem><para>Familiarity with installing and running JAR files.</para></listitem></itemizedlist>
58
59 <para><literal>html2db.xsl</literal> might work with earlier versions of Java and Xalan, and
60 it might work with other XSLT processors such as Saxon and
61 xsltproc.</para>
62
63 <para/></section><section><title>License</title>
64 <para>This software is released under the Open Source <ulink url="http://www.opensource.org/licenses/artistic-license.php">Artistic License</ulink>.</para>
65
66 <para/></section><section><title>Installation</title>
67 <itemizedlist spacing="compact"><listitem><para>Install JRE 1.3 or higher.</para></listitem><listitem><para>Install Xalan, if necessary.</para></listitem><listitem><para>Download <literal>html2db-1.zip</literal> from <ulink url="http://osteele.com/sources/html2db.zip">http://osteele.com/sources/html2db-1.zip</ulink>.</para></listitem><listitem><para>Unzip <literal>html2db-1.zip</literal>.</para></listitem></itemizedlist>
68
69 <para/></section><section><title>Usage</title>
70 <para>Use Xalan to process an XHTML source file into a Docbook file:</para>
71
72 <informalexample><programlisting>
73 java org.apache.xalan.xslt.Process -XSL html2dbk.xsl -IN doc.html &gt; doc.xml
74 </programlisting></informalexample>
75
76 <para>See <ulink url="index.src.html"><literal>index.src.html</literal></ulink> for an
77 example of an input file.</para>
78
79 <para>If your source files are in HTML, not XHTML, you may find the <ulink url="http://tidy.sourceforge.net/">Tidy</ulink> tool useful. This is a
80 tool that converts from HTML to XHTML, and can be added to the front
81 of your processing pipeline.</para>
82
83 <para>(If you need to process HTML and you don't know or can't figure out
84 from context what a processing pipeline is, <literal>html2db.xsl</literal> is probably not
85 the right tool for you, and you should look for a local XML or Java
86 guru or for a commercially supported product.)</para>
87
88 <para/></section><section><title>Specification</title>
89
90 <para/><section><title>XHTML Elements</title>
91 <para><literal>code/i</literal> stands for "an <literal>i</literal> element
92 immediately within a <literal>code</literal> element". This notation is
93 from XPath.</para>
94
95 <para>XHTML elements must be in the XHTML Transitional namespace,
96 <literal>http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd</literal>.</para>
97
98 <informaltable><tgroup cols="3"><thead><row><entry>XHTML</entry><entry>Docbook</entry><entry>Notes</entry></row>
99 </thead><tbody><row><entry><literal>b</literal>, <literal>i</literal>, <literal>em</literal>, <literal>strong</literal></entry><entry><literal>emphasis</literal></entry><entry>The <literal>role</literal> attribute is the original tag name</entry></row>
100 <row><entry><literal>dfn</literal></entry><entry><literal>glossitem</literal>, and also <literal>primary</literal> <literal>indexterm</literal></entry></row>
101 <row><entry><literal>code/i</literal>, <literal>tt/i</literal>, <literal>pre/i</literal></entry><entry><literal>replaceable</literal></entry><entry>In practice, <literal>i</literal> within a monospace content is usually used to mean replaceable text. If you're using it for emphasis, use <literal>em</literal> instead.</entry></row>
102 <row><entry><literal>pre</literal>, <literal>body/code</literal></entry><entry><literal>programlisting</literal></entry></row>
103 <row><entry><literal>img</literal></entry><entry><literal>inlinemediaobject/imageobject/imagedata</literal></entry><entry>In an inline context.</entry></row>
104 <row><entry><literal>img</literal></entry><entry><literal>[informal]figure/mediaobject/imageobject/imagedata</literal></entry><entry>If it has a <literal>title</literal> attribute or <literal>db:title</literal> it's wrapped in a <literal>figure</literal>. Otherwise it's wrapped in an <literal>informalfigure</literal>.</entry></row>
105 <row><entry><literal>table</literal></entry><entry><literal>[informal]table</literal></entry><entry>XHTML <literal>table</literal> becomes Docbook <literal>table</literal> if it has a <literal>summary</literal> attribute; <literal>informaltable</literal> otherwise.</entry></row>
106 <row><entry><literal>ul</literal></entry><entry><literal>itemizedlist</literal></entry><entry>But see the processing instruction <link linkend="simplelist">below</link>.</entry></row>
107 </tbody></tgroup></informaltable>
108
109
110
111 <para/></section><section><title>Links</title>
112 <table><title>Link Translation</title><tgroup cols="3"><thead><row><entry>XHTML</entry><entry>Docbook</entry><entry>Notes</entry></row>
113 </thead><tbody><row><entry><literal>&lt;a name="<replaceable>name</replaceable>"&gt;</literal></entry><entry><literal>&lt;anchor id="{$anchor-id-prefix}<replaceable>name</replaceable>"&gt;</literal></entry><entry>An anchor within a <literal>h<replaceable>n</replaceable></literal> element is attached to the enclosing <literal>section</literal> as an <literal>id</literal> attribute instead.</entry></row>
114 <row><entry><literal>&lt;a href="#<replaceable>name</replaceable>"&gt;</literal></entry><entry><literal>&lt;link linkend="{$anchor-id-prefix}<replaceable>name</replaceable>"&gt;</literal></entry></row>
115 <row><entry><literal>&lt;a href="<replaceable>url</replaceable>"&gt;</literal></entry><entry><literal>&lt;ulink url="<replaceable>name</replaceable>"&gt;</literal></entry></row>
116 <row><entry><literal>&lt;a name="mailto:<replaceable>address</replaceable>"&gt;</literal></entry><entry><literal>&lt;email&gt;<replaceable>address</replaceable>&lt;/email&gt;</literal></entry></row>
117 </tbody></tgroup></table>
118
119 <para/></section><section id="tables"><title>Tables</title>
120
121 <para>XHTML <literal>table</literal> support is minimal. <literal>html2db.xsl</literal> changes the
122 element names and counts the columns (this is necessary to get table
123 footnotes to span all the columns), but it does not attempt to deal
124 with tables in their full generality.</para>
125
126 <para>An XHTML <literal>table</literal> with a <literal>summary</literal> attribute
127 generates a <literal>table</literal>, whose <literal>title</literal> is the value
128 of that summary. An XHTML <literal>table</literal> without a
129 <literal>summary</literal> generates an <literal>informaltable</literal>.</para>
130
131 <para>Any <literal>tr</literal>s that contain <literal>th</literal>s are pulled to
132 the top of the table, and placed inside a <literal>thead</literal>. Other
133 <literal>tr</literal>s are placed inside a <literal>tbody</literal>. This matches
134 the commanon XHTML <literal>table</literal> pattern, where the first row is
135 a header row.</para>
136
137 <para/></section><section id="implicit-blocks"><title>Implicit Blocks</title>
138 <para>XHTML allows <literal>li</literal>, <literal>dd</literal>, and <literal>td</literal>
139 elements to contain either inline text (for instance,
140 <literal>&lt;li&gt;a list item&lt;/li&gt;</literal>) or block structure
141 (<literal>&lt;li&gt;&lt;p&gt;a block&lt;/p&gt;&lt;/li&gt;</literal>). The
142 corresponding Docbook elements require block structure, such as
143 <literal>para</literal>.</para>
144
145 <para><literal>html2db.xsl</literal> provides limited support for wrapping naked text in
146 these positions in <literal>para</literal> elements. If a list item or
147 table cell item directly contains text, all text up to the position of
148 the first element (or all text, if there is no element) is wrapped in
149 <literal>para</literal>. This handles the simple case of an item that
150 directly contains text, and also the case of an item that contains
151 text followed by blocks such as paragraphs.</para>
152
153 <para>Note that this algorithm is easily confused. It doesn't
154 distinguish between block and inline XHTML elements, so it will only
155 wrap the first word in <literal>&lt;li&gt;some &lt;b&gt;bold&lt;/b&gt;
156 text&lt;/li&gt;</literal>, leading to badly formatted output. Twhe
157 workaround is to wrap troublesome content in explicit
158 <literal>&lt;p&gt;</literal> tags.</para>
159
160 <para/></section><section id="docbook-elements"><title>Docbook Elements</title>
161
162 <para>Elements from the Docbook namespace are passed through as is.
163 There are two ways to include a Docbook element in your XHTML
164 source:</para>
165
166 <variablelist><varlistentry><term>Global prefix</term><listitem><para>A <indexterm significance="preferred"><primary>fake Docbook namespace</primary></indexterm><glossterm>fake Docbook namespace</glossterm><footnote><para>The fake
167 Docbook namespace is <literal>urn:docbook</literal>. Docbook doesn't really
168 have a namespace, and if it did, it wouldn't be this one. See <link linkend="docbook-namespace">Docbook namespace</link> for a discussion of
169 this issue.</para></footnote>
170
171 declaration may be added to the document root element. Anywhere in
172 the document, the prefix from this namespace declaration may be used
173 to include a Docbook element. This is useful if a document contains
174 many Docbook elements, such as <literal>footnote</literal> or
175 <literal>glossterm</literal>, interspersed with XHTML. (In this case it may
176 be more convenient to allow these elements in the XHMTL namespace and
177 add a customization layer that translates them to docbook elements,
178 however. See <link linkend="customization">Customization</link>.)</para>
179
180 <informalexample><programlisting>
181 &lt;html xmlns="http://www.w3.org/1999/xhtml"
182 xmlns:db="urn:docbook"&gt;
183 ...
184 &lt;p&gt;Some text&lt;db:footnote&gt;and a footnote&lt;/db:footnote&gt;.&lt;/p&gt;
185 </programlisting></informalexample></listitem></varlistentry><varlistentry><term>Local namespace</term><listitem><para>A Docbook element may be introduced along with a prefix-less
186 namespace declaration. This is useful for embedding a Docbook
187 document fragment (a hierarchy of elements that all use Docbook tags)
188 within of a XHTML document.</para>
189
190 <informalexample><programlisting>
191 ...
192 &lt;articleinfo xmlns="urn:docbook"&gt;
193 &lt;author&gt;
194 &lt;firstname&gt;...&lt;/firstname&gt;
195 ...
196 </programlisting></informalexample></listitem></varlistentry></variablelist>
197
198 <para>The source to <ulink url="index.src.html">this document</ulink>
199 illustrates both of these techniques.</para>
200
201 <note><para>Both these techniques will cause your document to be
202 invalid as XHTML. In order to validate an XHTML document that
203 contains Docbook elements, you will need to create a custom schema.
204 Technically, you then ought to place your document in a different
205 namespace, but this will cause <literal>html2db.xsl</literal> not to recognize it!</para></note>
206
207
208 <para/></section><section><title>Output Processing Instructions</title>
209
210 <para><literal>html2db.xsl</literal> adds a few of processing instructions to the output file.
211 The Docbook XSL stylesheets ignore these, but if you write a
212 customization layer for Docbook XSL, you can use the information in
213 these processing instructions to customize the HTML output. This can
214 be used, for example, to set the <literal>a</literal> <literal>onclick</literal>
215 and <literal>target</literal> attributes in the HTML files that Docbook XSL
216 creates to the same values they had in the input document.</para>
217
218 <variablelist><varlistentry><term><literal>&lt;?html2db attribute="<replaceable>name</replaceable>" value="<replaceable>value</replaceable>"?&gt;</literal></term><listitem><para>Placed inside a link element to capture the value of the <literal>a</literal> <literal>target</literal> and <literal>onclick</literal> attributes. <replaceable>name</replaceable> is the name of the attribute (<literal>target</literal> or <literal>onclick</literal>), and <replaceable>value</replaceable> is its value, with <literal>"</literal> and <literal>\</literal> replaced by <literal>\"</literal> and <literal>\\</literal>, respectively.</para></listitem></varlistentry><varlistentry><term><literal>&lt;?html2db element="br"?&gt;</literal></term><listitem><para>Represents the location of an XHTML <literal>br</literal> element in the
219 source document.</para></listitem></varlistentry></variablelist>
220
221 <para>You can also include <literal>&lt;?db2html?&gt;</literal> processing
222 instructions in the HTML source document, and they will be copied
223 through to the Docbook output file unchanged (as will all other
224 processing instructions).</para>
225
226
227 <para/></section></section><section id="customization"><title>Customization</title>
228 <para/><section><title>XSLT Parameters</title>
229 <variablelist><varlistentry><term><literal>&lt;xsl:param name="anchor-id-prefix" select="''/&gt;</literal></term><listitem><para>Prefixed to every id generated from <literal>&lt;a name=&gt;</literal>
230 and <literal>&lt;a href="#"&gt;</literal>. This is useful to avoid
231 collisions between multiple documents that are compiled into the
232 same book. For instance, if a number of XHTML sources are assembled
233 into chapters of a book, you style each source file with a prefix of
234 <literal><replaceable>docid</replaceable>.</literal> where <replaceable>docid</replaceable> is a unique id
235 for each source file.</para></listitem></varlistentry><varlistentry><term><literal>&lt;xsl:param name="document-root" select="'article'"/&gt;</literal></term><listitem><para>The default document root. This can be overridden by
236 <literal>&lt;?html2db class="<replaceable>name</replaceable>"&gt;</literal> within the
237 document itself, and defaults to <literal>article</literal>.</para></listitem></varlistentry></variablelist>
238
239 <para/></section><section id="processing-instructions"><title>Processing instructions</title>
240 <para>Use the <literal>&lt;?html2db?&gt;</literal> processing instruction to
241 customize the transformation of the XHTML source to Docbook:</para>
242
243 <informaltable><tgroup cols="3"><thead><row><entry>Processing instruction</entry><entry>Content</entry><entry>Effect</entry></row>
244 </thead><tbody><row><entry><literal>&lt;?html2db class="<replaceable>xxx</replaceable>"?&gt;</literal></entry><entry><literal>body</literal></entry><entry>Sets the output document root to <replaceable>xxx</replaceable>. Useful for
245 translating to <literal>prefix</literal>, <literal>appendix</literal>, or <literal>chapter</literal>; the default is
246 <replaceable>$document-root</replaceable>.</entry></row>
247 <row id="simplelist"><entry><literal>&lt;?html2db class="simplelist"?&gt;</literal></entry><entry><literal>ul</literal></entry><entry>Creates a vertical <literal>simplelist</literal>.<footnote><para>Note that the
248 current implementation simply checks for the presence of <emphasis role="em">any</emphasis>
249 <literal>html2db</literal> processing instruction.</para></footnote></entry></row>
250 <row><entry><literal>&lt;?html2db rowsep="1"?&gt;</literal></entry><entry><literal>[informal]table</literal></entry><entry>Sets the <literal>rowsep</literal> attribute on the generated <literal>table</literal>.<footnote><para>Note that the current implementation simply checks for the presence of <emphasis role="em">any</emphasis> <literal>html2db</literal> processing instruction that begins with <literal>rowsep</literal>, and assumes the vlaue is <literal>1</literal>.</para></footnote></entry></row>
251 </tbody></tgroup></informaltable>
252
253 <para/></section><section id="embedding"><title>Overriding the built-in templates</title>
254 <para>For cases where the previous techniques don't allow for enough
255 customization, you can override the builtin templates. You will need
256 to know XSLT in order to do this, and you will need to write a new
257 stylesheet that uses the <literal>xsl:import</literal> element to import
258 <literal>html2db.xsl</literal>.</para>
259
260 <para>The <ulink url="examples.xsl"><literal>example.xsl</literal></ulink> stylesheet
261 is an example customization layer. It recognizes the <literal>&lt;div
262 class="abstract"&gt;</literal> and <literal>&lt;p class="note"&gt;</literal>
263 classes in the <ulink url="index.src.html">source</ulink> for this document,
264 and generates the corresponding Docbook elements.</para>
265
266
267 <para/></section></section><section><title>FAQ</title>
268 <para/><section><title>Why generate Docbook?</title>
269 <para>The primary reason to use Docbook as an <emphasis role="em">output</emphasis> format is
270 to take advantage of the Docbook XSL stylesheets. These are a
271 well-designed, well-documented set of XSL stylesheets that provide a
272 variety of publishing features that would be difficult to recreate
273 from scratch for HTML:</para>
274
275 <itemizedlist spacing="compact"><listitem><para>Automatic Table-of-Contents generation</para></listitem><listitem><para>Automatic part, chapter, and section numbering.</para></listitem><listitem><para>Creation of single-page, multi-page, PDF, and WinHelp files from the same source document.</para></listitem><listitem><para>Navigation headers, footers, and metadata for multi-page HTML
276 documents.</para></listitem><listitem><para>Link resolution and link target text insertion across multiple pages and numbered targets.</para></listitem><listitem><para>Figure, example, and table numbering, and tables of these.</para></listitem><listitem><para>Index and glossary tools.</para></listitem></itemizedlist>
277
278 <para/></section><section><title>Why write in XHTML?</title>
279
280 <para>Given that Docbook is so great, why not write in it?</para>
281
282 <para>Where there are not legacy concerns, Docbook is probably a better
283 choice for structured or technical documentation.</para>
284
285 <para>Where the only legacy concern is the documents themselves, and not
286 the tools and skill sets of documentation contributors, you should
287 consider using an (X)HMTL convertor to perform a one-time conversion
288 of your documentation source into Docbook, and then switching
289 development to the result files. You can use this stylesheet to
290 perform this conversion, or evaluate other tools, many of which are
291 probably appropriate for this purpose.</para>
292
293 <para>Often there are other legacy concerns: the availability of cheap
294 (including free) and usable HTML editors and editing modes; and the
295 fact that it's easier to teach people XHTML than Docbook. If either
296 of this is an issue in your organization, you may want to maintain
297 documentation sources in XHTML instead of Docbook</para>
298
299 <para>For example, at <ulink url="http://www.laszlosystems.com/">Laszlo</ulink>,
300 most developers contribute directly to the documentation. Requiring
301 that developers learn Docbook, or that they wait on the doc team to
302 get content into the docs, would discourage this.</para>
303
304 <para/></section><section><title>Why not use an existing convertor?</title>
305
306 <para>This isn't the first (X)HTML to Docbook convertor. Why not use one
307 of the exisitng ones?</para>
308
309 <para>Each HTML to Docbook convertors that I could find had at least some
310 of the following limitations, some of which stemmed from their
311 intended use as one-time-only convertors for legacy documents:</para>
312
313 <itemizedlist spacing="compact"><listitem><para>Many only operated on a subset of HTML, and relied upon hand
314 editing of the output to clean up mistakes. This made them impossible
315 to use as part of a processing pipeline, where the source is
316 <emphasis role="em">maintained</emphasis> in XHTML.</para></listitem><listitem><para>There was no way to customize the output, except by (1) hand
317 editing, or (2) writing a post-processing stylesheet, which didn't
318 have access to the information in the XHTML source document.</para></listitem><listitem><para>Many of them were difficult or impossible to customize and
319 extend. They were closed-source, or written in Java or Perl (which I
320 find to be a difficult languages to use for customizing this kind of
321 thing) and embedded in a larger system.</para></listitem><listitem><para>They didn't take full advantage of the Docbook tag set and content
322 model to represent document structure. For instance, they didn't
323 generate nested <literal>section</literal> elements to represent
324 <literal>h1</literal> <literal>h2</literal> sequences, or <literal>table</literal> to
325 represent tables with <literal>summary</literal> attributes.</para></listitem></itemizedlist>
326
327 <para/></section><section><title>I got this error. What does it mean?</title>
328 <variablelist><varlistentry><term>Q. <literal>Fatal Error! The element type "br" must be terminated by the matching end-tag "&lt;/br&gt;".
329 </literal></term><listitem><para>A. Your document is HTML, not <emphasis role="em">X</emphasis>HTML. You need to fix it, or run it through Tidy first.</para></listitem></varlistentry><varlistentry><term>Q. My output document is empty except for the <literal>&lt;?xml version="1.0" encoding="UTF-8"?&gt;</literal> line.</term><listitem><para>A. The document is missing a namespace declaration. See the <ulink url="index.src.html">example</ulink> for an example.</para></listitem></varlistentry><varlistentry><term>Q. Some of the headers and document sections are repeated multiple times.</term><listitem><para>A. The document has out-of-sequence headers, such as <literal>h1</literal> followed by <literal>h3</literal> (instead of <literal>h2</literal>). This won't work.</para></listitem></varlistentry><varlistentry><term>Q. <literal>Fatal Error! The prefix "db" for element "db:footnote" is not bound.</literal></term><listitem><para>A. You haven't declared the <literal>db</literal> namespace prefix. See the <ulink url="index.src.html">example</ulink> for an example.</para></listitem></varlistentry></variablelist>
330
331
332 <para/></section></section><section><title>Implementation Notes</title>
333
334 <para/><section><title>Bugs</title>
335 <itemizedlist spacing="compact"><listitem><para>Improperly sequenced <literal>h<replaceable>n</replaceable></literal> (for example
336 <literal>h1</literal> followed by <literal>h3</literal>, instead of
337 <literal>h2</literal>) will result in duplicate text.</para></listitem></itemizedlist>
338
339
340 <para/></section><section><title>Limitations</title>
341 <itemizedlist spacing="compact"><listitem><para>The <literal>id</literal> attribute is only preserved for certain
342 elements (at least <literal>h<replaceable>n</replaceable></literal>, images, paragraphs, and
343 tables). It ought to be preserved for all of them.</para></listitem><listitem><para>Only the <link linkend="tables">very simplest</link> table format is
344 implemented.</para></listitem><listitem><para>Always uses compact lists.</para></listitem><listitem><para>The string matching for <literal>&lt;?html2b
345 class="<replaceable>classname</replaceable>"?&gt;</literal> requires an exact match
346 (spaces and all).</para></listitem><listitem><para>The <link linkend="implicit-blocks">implicit blocks</link> code is easily
347 confused, as documented in that section. This is
348 easy to fix now that I understand the difference between block and
349 inline elements (I didn't when I was implementing this), but I
350 probably won't do so until I run into the problem again.</para></listitem></itemizedlist>
351
352
353
354
355 <para/></section><section><title>Wishlist</title>
356 <itemizedlist spacing="compact"><listitem><para>Allow <literal>&lt;html2db attribute-name="<replaceable>name</replaceable>"
357 value="<replaceable>value</replaceable>"?&gt;</literal> at any position, to set arbitrary
358 Docbook attributes on the generated element.</para></listitem><listitem><para>Use different technique from the <link linkend="docbook-elements">fake
359 namespace prefix</link> to name Docbook elements in the source, that
360 preserves the XHTML validity of the source file. For example, an
361 option transform <literal>&lt;div class="db:footnote"&gt;</literal> into
362 <literal>&lt;footnote&gt;</literal>, or to use a processing attribute
363 (<literal>&lt;div&gt;&lt;?html2db classname="footnote"?&gt;</literal>).</para></listitem><listitem><para>Parse DC metadata from XHTML <literal>html/head/meta</literal>.</para></listitem><listitem><para>Add an option to use <literal>html/head/title</literal> instead of
364 <literal>html/body/h1[1]</literal> for top title.</para></listitem><listitem><para>Allow an <literal>id</literal> on every element.</para></listitem><listitem><para>Add an option to translate the XHTML <literal>class</literal> into a
365 Docbook <literal>role</literal>.</para></listitem><listitem><para>Preserve more of the whitespace from the source document especially within lists and tables in order to make it easier to debug the output document.</para></listitem></itemizedlist>
366
367
368 <para/></section><section><title>Design Notes</title>
369 <para/><section id="docbook-namespace"><title>The Docbook Namespace</title>
370 <para><literal>html2db.xsl</literal> accepts elements in the "Docbook namespace" in XHTML
371 source. This namespace is <literal>urn:docbook</literal>.</para>
372
373 <para>This isn't technically correct. Docbook doesn't really have a
374 namespace, and if it did, it wouldn't be this one. <ulink url="http://www.faqs.org/rfcs/rfc3151.html">RFC 3151</ulink> suggests
375 <literal>urn:publicid:-:OASIS:DTD+DocBook+XML+V4.1.2:EN</literal> as the
376 Docbook namespace.</para>
377
378 <para>There two problems with the RFC 3151 namespace. First, it's long
379 and hard to remember. Second, it's limited to Docbook v4.1.2
380 but <literal>html2db.xsl</literal> works with other versions of Docbook too, which would
381 presumably have other namespaces. I think it's more useful to
382 <emphasis role="em">under</emphasis>specify the Docbook version in the spec for this tool.
383 Docbook itself underspecifies the version completely, by avoiding a
384 namespace at all, but when mixing Docbook and XHTML elements I find it
385 useful to be <emphasis role="em">more</emphasis> specific than that.</para>
386
387 <para/></section></section><section><title>History</title>
388 <para>The original version of <literal>html2db.xsl</literal> was written by <ulink url="http://osteele.com">Oliver Steele</ulink>, as part of the <ulink url="http://laszlosystems.com">Laszlo Systems, Inc.</ulink> documentation
389 effort. We had a set of custom stylesheets that formatted and added
390 linking information to programming-language elements such as
391 <literal>classname</literal> and <literal>tagname</literal>, and added
392 Table-of-Contents to chapter documentation and numbers examples.</para>
393
394 <para>As the documentation set grew, the doc team (John Sundman)
395 requested features such as inter-chapter navigation, callouts, and
396 index and glossary elements. I was able to beat all of these back
397 except for navigation, which seemed critical. After a few days trying
398 to implement this, I decided it would be simpler to convert the subset
399 of XHTML that we used into a subset of Docbook, and use the latter to
400 add navigation. (Once this was done, the other features came for
401 free.)</para>
402
403 <para>During my August 2004 "sabbatical", I factored the general html2db
404 code out from the Laszlo-specific code, refactored and otherwise
405 cleaned it up, and wrote this documentation.</para>
406
407 <para/></section><section><title>Credits</title>
408 <para><literal>html2db.xsl</literal> was written by <ulink url="http://osteele.com">Oliver Steele</ulink>, as part of the <ulink url="http://laszlosystems.com">Laszlo Systems, Inc.</ulink> documentation effort.</para>
409
410 <para/></section></section></article>