Do We Really Need Structured Document Formats? (Is Real Reuse Possible?)

Do we really need structured document formats? In one meeting, every reason we came up with that made them seem necessary, was answered by a convincing counter argument. "Reuse" would seem to be the most important reason. And maybe there are some compelling cases. But  maybe all-out reuse isn't needed. Maybe we really only need a very restricted form that solves those cases.

This post summarizes the arguments we considered. Do they demolish the case for structured documents in a highly fluid setting like the software industry? Do they demolish the case for structured documents and reuse? Are they wrong in some important respect? Or do they overlook some vitally important point that makes structured document formats irreplacable?

You be the judge. And please let us know. We really want to know.


Structured document formats like DITA, DocBook, and Solbook are characterized by deeply nested tags and a multitude of schema constraints. Unstructured tagging languages like HTML, on the other hand, are wide open. Here are just a few of the more intriguing things you can do in HTML:

  • You can put an H1 head anywhere you want--even in a bullet list.
  • You can put an <hr> tag between list entries to create a separation
  • You can start a list item with <li> and end it with </p> to achieve double-spacing
That's all very interesting for writers, of course. But it's hell on processors. Browsers and HTML editors spent more than a decade handling all of those weird cases. That's why there are no really good diff tools or transformation languages for HTML--it's hard to write regular rules for something that is so completely irregular.

XML languages solve the tooling problem nicely by enforcing regularity in the tags. Every opening tag must be paired with a closing tag. Tags must nest properly. You can't just put a heading anywhere you want. Only specific locations are allowed for tags.

Their strength comes from a combination of nesting and schema constraints. But their greatest strength is also their greatest weakness. Because the very regularity that enables sophisticated processing also makes them very difficult to edit--and the really good editors are also really expensive.

These thoughts came to a head the other day when I was in a meeting. The question was asked: "Do we really need structured documents?" We could think of several problems they solved, but in the course of the conversation we surfaced compelling counter arguments for each case:

  • Conditional metadata
  • Variable substitution
  • Arbitrary transclusion
    • Transclusion in Structured Formats
    • The Difficulty of Editing a Structured Format
    • Transclusion in Wikis
  • Formatting & containment
  • Production capabilities

Those thoughts made me wonder if we could get along just fine with a simpler system--perhaps one that lives at the 80/20 point, where we can get most of the benefits of reuse without adding the complexity entailed by document structuring. (But I confess to having to no particular vision of what such a system would look like, at the present time.)

So in addition to considering the arguments above, this post examines:

  • Is There an 80/20 Solution?
  • What about Version-Oriented Reuse?
  • So: Are Structured Formats Necessary? 

Conditional Metadata

Sometimes, you only need to reuse boilerplate material--an electrical warning, for example, that goes in every product document. But if you're going to reuse task or reference information, odds are that you'll need to conditionalize some of it: "In this setting, say this. In that setting, say something else." (The DITA format is particularly adept at that. Any metadata you define can be applied to any element, anywhere in the hierarchy, where it can trigger conditional inclusion.)

But the fact that such a thing can be done does not mean that it is necessarily desirable to do so. Experience suggests that reuse gets tricky when your environment changes--because your metadata reflects your environment. If your environment doesn't change, then your metadata is fixed. You define it, assign it and that's the end of it. The metadata tagging adds some complexity to your information, but you can live with it, and it buys you a lot.

But when today's "Product A" splits into tomorrow's "Product A1" and "Product B (formerly A2)", as happens so often in the software industry, the metadata hierarchy has to change--along with every bit of information tagged using the previous hierarchy. (Those changes have implications for variable-substitutions as well. That's the subject of the next section.)

When the environment changes, the additional complexity built into the docs begins to hurt. You not only have to change content, you also have to redefine your metadata, re-tag your content, and keep track of where you are when you're halfway through the process.

In this case, you suddenly need three tags, one for each of the individual products and one for the bits used by both. You now have to examine every bit of information to see which of the three tags it should get. That sort of re-tagging scenario can quickly become a nightmare.

So perhaps the kind of metadata tagging that promotes reuse is not as useful in a highly-fluid setting like the software industry--even though it may be perfect in an older industry where metadata is more stable.

Variable Substitution

Similarly, what about the nice little product-name variable that got created for when "Product X" morphs into "Product Y". That seems like all you need, until marketing pulls a fast one and comes up with the "Product X family", which includes a couple of different products. Now the concept pieces you wrote for Product X need to describe the family, rather than a single product, and then introduce the members of the family.

So you find yourself needing to substitute for <Product X> in sentences like these:

 "The configuration of <Product X> is a blast..."
 "<Product X> configuration has been made easy..."
 "Working with a <Product X> configuration is..."

You see the problems. In the first two cases, the article "the" needs to be added, and it needs different capitalization when it occurs at the beginning of sentence:

"the product X family"
"The product X family"

But the first and third cases have a deeper problem, in that you never configure a "family", but rather, one of the products in that family. So in the first case <Product X> would need to become "a product in the Product X family" (with different capitalization depending on where it occurs). In the third case, the word order needs to change from "<Product X> configuration" to "configuration of <a product in the Product X family>".

So what do you do? One option is to go through all of the documents in your system, meticulous revising them to reflect the new metadata hierarchy in all of its wonderful variations. But here's the rub: Not all of the variables need to be changed:

  • Some places that had a <product_name> variable need to be changed to <product_family_name>

  • Others don't, and should continue to talk about Product X or Product Y, as appropriate

How do you tell? You can't automate it, that's for sure. It takes human eyeballs. So once again you're traipsing through all of your docs to make changes.

And as sure as the world is round, you're going to find yourself somewhere downstream with another real world change that messes up everything--like maybe the whole "family" concept disappears and marketing no longer wants to have any documents that talk about the family as a whole.

Arbitrary Transclusion

A step up from single-element transclusion is the ability  to transclude whole subsections. Once again, the DITA format excels at that task. Anything you can define in a topic can be transcluded elsewhere. (Again, just because you can, doesn't mean you should. Best practice suggests that transcluded material should be gathered together in some common area, rather than pulling things willy nilly out of existing documents.)

Transclusion in Structured Formats 

The ability to transclude material depends on two factors: The ability to tell where it starts, and the ability to tell where it ends. When a URI points to an element, it specifies the starting point. And the element's closing tag tell where that particular bit of material ends.

In addition to the regularity of structured formats, it is their nesting that makes arbitrary transclusion readily possible. In DITA for example, when a <section> starts, it may contain any number of sub-sections, but its end point is well-defined. In HTML, on the other hand, an H2 section could be terminated by an H1, another H2, or the end of file--but if the next H2 you see is in a list or a table, you have to ignore it--unless the original H2 was in that list or table, ...

You see the difficulty. The complexity compounds to where it's just impractical to make things work. So nesting greatly enhances the capacity for back-end processing--but it also makes editing difficult...

The Difficulty of Editing a Structured Format

So nested tags are highly desirable for transclusions. But it is also the case that the combination of nesting and schema constraints makes it very difficult for people to work with a structured format, even with expensive editors.

There are understandable reasons for that difficulty. Schema constraints prevent you from adding things in the wrong place, but the right place can be difficult to find:

  • Element nesting produces long sequences of closing tags.

  • You can add some elements between one set of closing tags, and other elements in the position adjacent to it.

  • Cursor placement must be very precise, as a result--but all of those closing tags are so much whitespace, so a WYSIWYG editor doesn't help you at all. It hurts, in fact.

  • To control where the cursor goes, you either have to work with the tags directly (shudder), or you need a view that shows you where the tags are.

  • By definition, at that point, you have left the realm of "user friendly behavior". You are forcing the user to understand the internal structure of the markup language.

  • And even when you use an expensive editor that tries to be helpful, the list of items you can insert in each position may not contain the thing you want, because it's under something else that you have to insert first. (Simple example: You may know you want to create a list item in html <li>. But before you can do that, you have to choose the type of list and insert that starting tag for that (<ol> or <ul>).

If the thing you want to add isn't visible, and you don't recognize the containing element, you're hopelessly out of luck. You wind up moving the cursor from position to position, scanning lists of tags you don't recognize.

Note:
One editor tries to ameliorate that problem by jumping down to the next the legal location where an insertion can be made. But that means you are choosing from all possible tags, which can be difficult. It can produce significant surprises, as well, when the editor jumps down half a page to make the insertion. And in any case, you are forcing the user to understand the markup language.

Those observations explain why structured document formats are so difficult to use: They force you to memorize the tagging structure. They require training, as a result, because it's virtually impossible for the average user to be productive without it.

The editing situation is much better with DITA (120 topic tags, plus 80 for maps) than it is with DocBook (800 tags), or even Solbook (400 tags), but it is still way more difficult than simple HTML--80 tags, many of which can nest, but few of which have to.

But even with a relatively simple format like HTML, we have manual-editing horror stories. In one instance, a title heading was created with 18 non-breaking spaces and a 21-point font. (Try putting that through your automated processor.)

That sort of problem suggests that all editing should be done in a WYSIWYG editor--except for one problem: I have yet to see a nested-markup editor that does not require you to go into the markup to correct the tags, at some point. Even editor programs get confused by the nesting!

Note:
Desktop editors like Open Office, on the other hand, work beautifully without ever requiring the user to see or understand the underlying markup language. One reason for that reliability is the extensive development that has gone into them. But another very significant reason may well be that that they are presentation formats, like HTML, so there is less need for nesting.

Transclusion in Wikis 

Wikis, of course, are focused on simple markup for the simplest possible subset of HTML. Of course, that strategy leaves out some important functionality. But there is no denying that the result is user friendly.

Wikis provide a minimal level of transclusion, as well. In MediaWiki, for example, you mark a section to declare it as transcludable. You can then transclude that material by putting the reference in double-braces: {{some_section}}

Of course, you don't get the same kind of multi-level referencing that you get with a fully nested structure. But then, do you really need it? (If so, in which cases is it an absolute requirement. More on that subject in a moment.)

Note that I like nesting. I really do. I fell in love with it back when I was working on outliners, and found that I could move a chapter as easily as a word. A section title becomes a "handle", in effect. You just grab the handle, drag it somewhere else, and everything in that section comes along for the ride. Super convenient.

But none of the major XML editors make any real use of that potential--for a fairly good reason. In an outline, you have a hierarchy of homogeneous types. So anything can move anywhere. That makes for a really powerful interface. But in a structured format with heterogeneous tags, schema restrictions are such that you can't just move things around at will. That explains why structure editors don't implement such a feature--there's little you could do with it, if you had it.

So given that nested formats don't give us a lot of manipulative power, and they do give us a lot of headaches, how necessary are they, really? Are there any reasons other than reuse that make them necessary?

Formatting and Containment

So far, the focus has been nesting and its relationship to reuse. But that isn't the only thing that nesting is good for. There is also the small matter of formatting and containment.

For example, take an ordered list with an additional paragraph or two of explanation under each item. Common Wiki markup languages do not handle that situation at all. You wind up having to manually number the list, or every entry in it has to be a link to the actual information.

One possible workaround for that problem is to use Wiki transclusions (assuming they solve this particular problem). So, in MediaWiki syntax, instead of manually numbering steps like this:

  1. Do X

      Explanation of X

  2. Do Y

You would transclude elements of the list:

  # {{how_to_do_X}}
  # {{how_to_do_Y}}

I haven't tried an experiment to see if that works. If it does, that solves the problem for the reader. But it sure makes life more difficult for the writer. That problem, in turn, can only be solved with a WYSIWYG editor that automatically does the refactoring behind the scenes, so what appears to you as a single page is in fact composed of multiple separate components.

Of course, that's a lot of trouble to go to for a single case. But that may be one of the only cases where nesting is really necessary. (The original version of HTML certainly seemed to reflect that opinion, since lists and tables are the pretty much the only basic HTML tags that do nest.)

And therein lies the "easy win" for this otherwise problematic use case: Since the really useful Wiki languages allow (x)HTML tags to be embedded in the page, nested lists can be constructed when needed, without having to nest every tag, everywhere.

Note:
It's interesting to look back at the original version of HTML and to recognize that, minimal as it was, it defined nesting for lists and for tables. My two observations: 1) Yeah, they really were that brilliant, and 2) Those would seem to be the two cases where nesting is really unavoidable.

Production Systems

Perhaps the major advantage of structured document formats are the production tools that let you produce multiple deliverables. PDF and HTML top the list, but there is also troff and help.

But given there there are multiple ways of converting unstructured docs like Wiki text or HTML into structured formats, it would seem reasonable to feed them into one or more production systems.

Here are the production paths that can produce docs from plain Wiki text and ODF, without having to do anything special to either one:

           HTML -PP+OT-> DITA +-> +-OT-> <docs>
                 |        |   +-OT-> DocBook -> <docs>
Wiki --> HTML ->|        |   +-$Bridge-> DocBook --> <docs>
                 |         |                    |
ODF ---> HTML --+         |                    |
      |                   |                    |
      +-$Bridge-> DITA -------+                    |
      |                                            |
      +-$Bridge-> DocBook ----------------------------+

Notes:

  • OT is the DITA Open Toolkit.

  • PP is a pre-processor that needs to be written. The OT consists of XSL, which does not lend itself to recursive processing. But to convert anything with more than one level of subheads, recursive processing is needed to create nested topics. (That may be the best argument yet for Martin Fowler's XSLT in Ruby.)

  • MediaCloth is ideal for the Wiki conversion, because it's driven by grammar rules, and because the HTML conversion is completely separated from the web-service and interaction management.

  • $Bridge is the DITA/DocBook/ODF bridge sold by Flatirons Solutions. Since it's a commercial product, it will be solid and supported. But it costs.

Of course, the conversions have to work properly, and the more steps in the chain, the more places there are for things to go wrong. But given that the whole world seems to be working on these pathways, it's a pretty safe bet that, sooner or later, one or more of them is going to work well for your purposes, whatever they are.

Then there are the Solbook variations, and attendant questions:

  1. For each transform that produces DocBook, how well do the generated codes match the Solbook subset?

  2. The declaration needs to be modified, at a minimum, to point to the Solbook DTD. That, along with any changes to generated codes, can be done either in a post-processing step or by modifying the transform. Which is better?

  3. Is it possible to write a generic DocBook --> Solbook transform that covers all of the tags generated by the transforms? A post-processing transform of that kind could be easily inserted into any of the production pathways. (One major problem: How to translate the generic section heads that exist in DocBook, but not Solbook.)

Is There an 80/20 Solution?

So maybe a really minimal transclusion-capability is all we really need for reuse. Maybe we need to transclude boilerplate sections, and that's about all.

Instead of variable substitution, we'd have global, list-driven, pattern-based substitution that made it easy to run through the docs, selecting possible substitutions from a list--like the one I once wrote at Oracle, and have coveted ever since. (It's something you also need when changes to the metadata hierarchy makes you change document variables, so it's a desirable tool in any case.)

Instead of conditional metadata, we would have refactoring tools that let us split a topic into sections and have all references to that topic automatically replaced by references to the collection. (Joins are another problem, too deep to go into here.) So if you wanted to replace one sentence with another in a different context, you would split the material into three sections--the part above the sentence, the sentence, and the part below the sentence. You would then produce two versions of the document,  both would map in the common material, but would provide different versions of the sentence.

To be practical, the editor would have to deal with the transclusions invisibly (otherwise, you'd never see what you were editing, you'd be working with fragments that you have to"put together in your head". But given a sufficiently rich editor, such a system could work.

So maybe really simple transclusion of sections is all that we really need. Instead of forcing everything to be nested, maybe the kind of transclusions that MediaWiki implements is sufficient: {{insert X here}}, where X is restricted to be a section that has been specially marked so it can be referenced. (Ok. Maybe we need a little more than that. But it's a place to start.)

What about Version-Oriented Reuse?

In Wikis, Docs, and the Reuse Proposition, I alluded to a theme that needs to be expressed more succinctly. Here is that theme:

In the world of technical documentation, there is a tension between two
very different, but equally important needs: The need to collaborate (best
served by Wikis) and the need for reuse (best served by structured
document formats).

If you are working on a purely altruistic open source project, with no profit motive whatever, and if you are doing so as part of a small, independent effort that does not share resources with other departments or other projects, then reuse is going to be an insignificant factor in your equation. You can and should optimize for collaboration. Install a Wiki, and let that be your documentation set.

But if you want to make a living at what you're doing, you'll need to find some way to fund the project. And it is sustaining engineering that makes that possible in an open source world. (With sustaining engineering, the world at large collaborates on the open source version of the project, adding many new features and introducing many new bugs in the process. Of course, old bugs are fixed as well. But organizations who depend on the application for mission-critical operations want the fixes for old bugs, and possibly a small set of added features, without risking the farm on a whole new crop of bugs--and they'll pay to get bug fixes in a way that safeguards their income stream.)

But if you have sustaining engineering, by definition you have at least two versions of the documentation. Of course, you want them to be single-sourced, if at all possible, especially when two versions becomes, three, four, or more.

As mentioned in the aforementioned post, there are 5 dimensions of reuse for technical docs: across products, versions, platforms, audiences, and document types (tutorials, courses, reference docs, etc). The sustaining-engineering use case represents reuse across versions--a highly critical dimension for a project that wants to make enough money to sustain itself!

Fortunately, reuse across that particular dimension doesn't run into the same problems that you get with more general forms of variable substitution or conditional metadata.

Variable Substitution 

At a minimum, you want the document title to contain the product version, so that when a user searches, they'll see something other than a long list of documents for different versions, all with the exact same title! For that, you need variable substitution. But substituting one version number for another is a piece of cake! There are no problems of grammar or usage to worry about. No problems of number or gender. A number is a number. So variable substitutions work fine, on this dimension.

Conditional Metadata 

And it is almost always the case that some documents will contain differences in the new version. If the changes are small, conditional metadata makes it possible to single-source the document versions. And even if they're large, and require a total rewrite, it will still be the case that the other documents in the doc set will need to reference the new version, rather than the old one. There are a number of ways to solve that problem, and conditional text is one of the options. (Some form of link-variable or variable-based redirection is another possibility.

Fortunately, as is the case with variable substitution, version metadata doesn't tend to run into the same problems that more generalized forms of metadata encounter. A version number is a version number. Whether products are organized into this or that family has little to do with their version numbers. (Even if the numbering changes, that's a pretty straightforward change to the metadata, and one that allows global search/replace on the content tags.)

So: Are Structured Formats Necessary?

In at least the case of version-dimension reuse, variable substitution and conditional metadata seem to be a darn good idea. And in at least the case of table and list tags, nesting seems to be a requirement. So it's clearly not the case that we can completely do without such capabilities.

On the other hand, the counter arguments against other forms of variable substitution and conditional metadata remain intact--at times, it is just too costly to keep them working, especially in an environment that changes frequently. And nesting everything may well be overkill, when so few forms of nesting are actually indispensable.

In addition, I know from personal experience that it is possible to be "seduced by the capacity for reuse", to the point that you over-engineer your docs like crazy, and take forever to deliver something "perfect" that would have much better received had it been much more imperfect, and much more rapidly produced!

So there would seem to be an optimal level of reuse, somewhere, that has yet to be identified. (It is almost certainly the case that different situations have a different optimum--so it may be that a taxonomy of cases is waiting to be discovered.)

But given that we could determine the optimal point in any individual case, the question becomes: What is the best strategy for implementing that level of reuse? Is it better to start with a highly general system like the DITA markup language--a system that is fully optimized for reuse--and to voluntarily restrict yourself from going beyond the point of optimality? Or is it too difficult to determine that point? Or is it possible that the siren call  of ever greater generality will always prove too seductive for mortals to bear?

Or is it better  to start with a system that provides no kinds of reuse--an ultra-simple system, like a Wiki, that gets reusable elements grafted onto it when and if the need for them becomes an absolute requirement? That is one of the hallmarks of the agile development methodology, after all--you only implement the functions you determine to be absolutely necessary, right at this moment. (I should appreciate that dictum, god knows. I've spent years adding functionality that might be needed some day and optimizing the hell out of systems that no longer even have any hardware to run on!)

So there you have it. I've argued both sides of every cases, at this point. Hopefully I've illustrated the tension between structured document formats and relatively unstructured HTML/Wiki systems. One enables reuse and allows sophisticated processing. The other is easier to edit--and can be edited online--which fosters collaboration because of the exeedingly low barriers to entry.

Right now, there is a major gap between those two worlds. What sort of system is optimal to bridge that gap?

Next: Structured Document Formats, Part II.

Resources 

Comments:

I'll kick things off - too cold to ski!

Great topic, Eric, and absolutely critical to the well-being of DITA projects - and to technical writers' mental health - everywhere.

My dad used to say that if you are driving a vehicle and fooling around with your girl friend at the same time, you're doing two things badly. Focus is the key.

XML in general and DITA in particular solve the problems you mention at the top of your post. Solving the 're-use' issue, among others listed, requires admitting that the architecture is trying to reconcile two dialects at one time.

The semantics of DITA is not necessarily the semantics of the products (business process, software, electrical appliances, recipes, etc.). I can author valid DITA topics for any two of those, put them in a document and have my output in two or three output formats.

For example, I can create an <ol> component for steps for filing an expense report, and publish them directly above or below an <ol> for steps for grounding a satellite dish. DITA won't blink, but most rational folks recognize that unless I'm erecting a satellite dish to file my expense reports, those two things don't belong on the same page. The cognitive overload is much worse the closer the two concepts get.

The problem you describe (product splits, aggregations, and morphs) create new semantics in the product domain, not DITA: the parts of DITA that can handle the content that describe the product configurations need not change.

The question of what is being re-used and by which domain is the real issue. DITA's schemas and rules enforce reuse of elements, assembly, and processing in the content domain. I believe that if a similar approach was used for products AND a reliable method existed to associate the product components, features, events, states, and activities to the DITA components, then most if not all of the challenges you list would disppear.

Best regards.

John O'

Posted by John O'Gorman on February 09, 2008 at 05:27 AM PST #

Thanks for the comments, John. I am particularly struck by this passage:
>
> The question of what is being re-used and by which
> domain is the real issue. DITA's schemas and rules
> enforce reuse of elements, assembly, and processing
> in the content domain. I believe that if a similar
> approach was used for products AND a reliable method
> existed to associate the product components, features,
> events, states, and activities to the DITA components,
> then most if not all of the challenges you list would
> disppear.
>
I would love to hear you elaborate on that point--perhaps over a cuppa. On the surface, it seems as though DITA specializations would create product associations. It's then a matter of constraint on what you can incorporate, isn't it? The idea is to keep from incorporating elements from other domains, but I that DITA can pretty well do that, yes? For example, if there is a "parts" topic in which "auto part list" extends "ordered list", it would not be possible to put a business-process-item into that list.

That's my immediate reaction to what I'm hearing, but it is entirely possible that I'm not hearing what you mean to say. (I know I believe I understand what I think you said, but I'm not sure you realize that what I think I heard is not what you meant.)
:_)

Posted by Eric Armstrong on February 09, 2008 at 09:45 AM PST #

Ty for the thoughtful work. Who is the "we"? What is the context?

My two cents:

You do not need structured formats if you are writing information that:
--cannot be reused (or you do not want to reuse)
--does not have to be correct
--Quantity of information is extremely small and will not grow much

Otherwise, if you are a professional technical writer, I can't imagine why you wouldn't want to use a structured format for content.

Explanation
I make intellectual property (IP) for the company. I create, manage, and distribute the information (not documents!) that our field team, partners, and customers need to implement and use our products. The IP I create goes into product Help, PDFs, tools help (Eclipse Help), and a support site. If I change one important fact (because it was incorrect), I want the cheapest method with the smallest chance of error of correcting and distributing that fact. A structured format allows me to define that fact once and build all the targets I need to with the smallest margin of error. I can do it all day long with the same results. I cannot guarantee my copy-and-paste technique is that reliable even on the first iteration.

If the incorrect fact is not corrected, my company could lose money or reputation, or I could lose my job.

The examples you give of HTML benefit are, I believe, trivial and would solve a tiny minority of presentation requirements I have. Product information needs to be useful, and a consistent (usually structured) organization and presentation increases its usefulness.

I'm only starting to use DITA for one small sub-set of product information and I will not implement many features of DITA. But the core features, reuse and possible automated builds, far outweigh the complexities and cost.

Context
I use unstructured FrameMaker for my doc set and will use DITA to document our product development tools. I will use BlueStream's XDocs and XMetal as my tool kit because they work well enough together that I can achieve my humble goals without worrying about the OT and its richness.

Posted by Paul Reeves on February 10, 2008 at 12:15 AM PST #

Top article.

But I think you're focusing on the shortcomings of reuse/transclusion and calling them shortcomings of structured formats. IMO transclusion is something that needs to be handled \*very\* carefully. As you point out with your -Product A- example, even simple variables end up unwieldy when a product range broadens (as, hopefully in a successful company, it will). I used to work in a company where there was significant scope for custom options. But often the custom options were rolled in to a separate product of their own. And some products were available in different physical forms (this was a real-stuff company). Using conditional content for all this quickly became a pain-- but this isn't the fault of the structured document format.

My mistake lay in assuming because the text for -Product A- and -Product A2- was similar, they are good candidates for reuse. As soon as marketing split up A and A2, the two topics become different on a fundamental level. Even though the two new products start off the same, there is no reason to think that that will continue to be the case, so reuse is just going to be problematic. The only advantage of reuse in that situation is if I find a typo in the original file, or I think of a better way to explain a common thing, I only have to edit it once -- but that's not an indispensable feature by a long chalk.

The structured document format doesn't solve these problems, and it's misleading to assume that it will. What it \*does\* do is allow me to ensure consistency in the presentation of related information types, and to provide a much quicker route from automatically-generated information to final presentation. Nothing to do with reuse.

Real reuse comes at the topic level, above the document format. That's where the DITA "ethos" pays dividends: handling discrete concepts, it's easy to place these into documents where and only where that particular concept applies. Real reuse doesn't need structured document formats. Wikis are great for that kind of reuse. Let the developers bash away at the wiki, help them organize (and proof-read) the wikitopics, and pull'em into shape when a document requirement comes along.

I have never used a DITA conref - and probably never shall. It just doesn't solve the right problem.

Posted by David Linton on February 10, 2008 at 05:57 PM PST #

"Nothing to do with reuse." -- there should be a different word for the kind of reuse Paul Reeves identifies, of presenting the same information in many different forms. That is an advantage of the format, because of its ease of processing.

Posted by David Linton on February 10, 2008 at 06:03 PM PST #

Eric,
First, thanks a ton for that interesting post.
I've just come over to Sun from 5 years at an XML house where we used a customized DocBook 4.2 for all our documentation needs. Doc output originally was only HTML but soon after I joined, PDF was added.
In retrospect, there are things I liked and things I hated about DocBook. You are quite right that the user has to understand the markup language. Happily, Release Engineering handled all XSLT and stylesheets, so we didn't have to learn that, too. Unhappily it also meant we had no direct control over output look and feel, and if the RelEngs were busy it could be weeks or months before a stylesheet or transformation issue was addressed.
Oh, there's no reuseability within a <book> in DocBook, either. All id values must be unique, meaning each section appears once and only once.
Now I am writing tutorials directly in HTML, plus some internal Wiki stuff. At some point I'll be working in Javahelp as well. It's certainly much more forgiving, and we have actual WYSIWIG tools. I had been working entirely in XML source using the oXygen XML editor. On the other hand, once I understood the basics of DocBook syntax, I knew exactly what the heck was going on in my doc. In HTML I often have no idea. After 5 years of XML, unclosed tags and source-side formatting hacks cause me physical pain. HTML just strikes me as a chaotic mess where it's almost impossible to get some kind of consistency even between writers sitting next to each other in the same office, since there are so many ways to do the same thing and all an HTML writer cares about, really, is what the output looks like.
On the other hand, I'd never suggest DocBook for online tutorials, not in a million years. Gross overkill. The smallest doc I was working on in DocBook was around 150 pages and one set was over 700. The enforced book format was quite useful for that. For an online tutorial of at most 20 pages (if it were PDF) and 6 or 7 sections, DocBook would be far too much overhead IMO.
I wish I knew more about DITA. We'd looked into it a bit but never considered implementation. Doc system engineer was of the opinion that you could get a lot of the same functionality through proper use of profiling tags in DocBook.
Last thing, and that is that I'm surprised you haven't had more to say about XHTML. I'm quite interested in it for what I'm working on now. It would seem to require the least work for conversion from HTML while enforcing some kind of structure. I don't know much about Wikis beyond my Twiki experience, which is not something I'd want to inflict on a casual external user.

Posted by Jeff Rubinoff on February 10, 2008 at 08:50 PM PST #

Steve Whitlatch wrote the following in an email message:

> Do We Really Need Structured Document Formats?
>
Maybe not. Structure provides many benefits; attempts at reuse might.

Real reuse is possible, but it might require more human involvement than management hopes for. Worst case: many humans search, read, and decide. Better case, fewer humans build a sufficiently intelligent system that does what the many humans would have done.

I expect that planning for extensive content reuse without planning for the necessary additional human intelligence (and required intervention) would produce bad documentation, with wrong content, missing content, etc., and that would probably be the result of management trying to get profit out of the content reuse attempts, squeezing too hard, expecting too much, going too fast, under-funding the project, etc.

> > [. . . ] where we can get most of the benefits of
> > reuse without adding the complexity entailed
> > by document structuring.
> >
You want structure. My arguments are below. However, profitable content reuse may prove elusive. Determined attempts at extensive content reuse may even prove to be of negative value. If early adopters get burned, or if they have great success, perhaps they will share their experiences. I am guessing that the value of content reuse attempts will prove to be very sensitive to the degree at which components are actually shared. If so, one should ask if DITA is needed to get the benefits of content reuse. It may be that when products share components, shared content is easy, requiring no special schema/DTD specifically designed for reuse. I don't know.

Your arguments focus on content reuse and DITA, but the title of your article does ask if " . . . We really Need Structured Document Formats." Here are some arguments for structure in general, separate from the issue of content reuse.

1) Structure facilitates content storage and management, as with a relational database or a native XML database, including SQL/XQuery, backups, security, etc.

2) Structure makes it easier to publish the same content to multiple formats, but you stated that in your article.

3) Structure _can_ remove from authors all formatting duties.

4) Structure can make possible the use of inexpensive client-server publishing systems.

Posted by Eric Armstrong on February 11, 2008 at 08:53 AM PST #

Thanks for the discussion. It's good to question.

As a long-time structured author, I found myself thinking about a couple of things as I read through the article:

1) In one company where I was helping to bring in structure, the template-creator started with the misguided priority to make the structured template "do everything" that the authors had been able to do before. But the very reason to change to structure was to change and streamline the authors' writing based on extensive analysis of what was really needed, not to perpetuate the old way through a new tool.

The discussion of the flexibility of HTML struck me as the same thing. It's good, not bad, that your XML solution isn't as flexible as HTML. Accommodating the same things writers have always done isn't a system requirement.

2) I'm not convinced that learning a mark-up language, or at least a schema, is a bad thing for writers. We are professionals and capable of understanding our tools. It certainly would not hurt our esteem in a company, either.

I think it comes back to the same presumption that structured authoring is supposed to do exactly the same things as unstructured authoring. But this is an entirely different methodology. It requires changes.

Posted by Amanda Cross on February 11, 2008 at 11:16 PM PST #

Hi, Amanda. Thanks for the comments.

When it comes to writers, I agree with you. There's no reason not to learn a thoroughly useful markup language, especially one that is as minimal as DITA. But in an open source world characterized by literally millions of potential contributors, restricting document production to writers creates one very large bottleneck!

Interestingly, writers often face a bottleneck themselves, when they need information from developers. Wiki pages have shown their value in that respect, in at least one project.

With a Wiki, developers can add material themselves. With the low barrier to entry, more of them do. That makes areas that are insufficiently documented stand out, and we've seen the responsible engineers coming to writers, wanting someone to help populate those pages. That's a pretty neat inversion of the typical scenario of writers chasing developers--one that was entirely unexpected when Wiki-based documentation was chosen.

On the other hand, I have to agree that HTML's very flexibility is a problem. As I think about trying to automatically refactor an arbitrary HTML page into DITA topics, I'm struck by a long list of potential problems.

For example, the lack of hierarchical nesting can make it difficult to identify section heads:

\* There may not be an H1 head, or even an H2. There
could even be no title at all--just 18 non-breaking
spaces and a 21-point font.

\* There could be an H3 subtitle before H2 subheads

\* There could be a list of TOC links that should
be ignored

\* There could be an H3 or H4 that says "Contents"
before it.

\* There could be an H4 or H5 byline before the
actual content starts.

That kind of irregularity can make it pretty difficult to figure out where to split a page into topics. The program is going to wind up having to make assumptions, which means there will be many failures. Either that, or it's going to have the gain the equivalent of a human being's optical structure recognition--no small feat of semi-AI hacking.

Posted by Eric Armstrong on February 12, 2008 at 02:12 AM PST #

I enjoyed rich content-recycling during the years I managed technical documentation for more than a dozen non-native-English-speaking software engineers.

LaTeX and pdfLaTeX made amazing results possible -- and every time I had a need for one solution, I'd also find two or three additional techniques that I could implement immediately for additional doc-product enhancement.

My hands-on education in the development of .tex and .sty files (conceptually similar to .css) brought me skills that transferred readily to the world of xml. Learning to re-use content had been the easy part. The capabilities brought by pdfLaTeX practically begged me to re-use content \*again\*, just to see how another option might perform.

Real Life is \*so\* much bigger than single-use content allows!

Posted by Elisabeth Baker on February 12, 2008 at 07:17 AM PST #

A very interesting article.

You are right, reuse doesn't work in all instances. That's for certain. But there are many instances, notably where people are documenting a product line, rather than single products where it has been very, very effective. You need to analyze your options carefully, a point you very clearly made.

On the other hand, structure works in all instances. Even HTML is structured, although extremely loosely so.

The question is really how tight, restrictive, controlled, etc. do you need your structure to be? That question requires you to answer two more:

- How do you ensure that your authors create the content that you need in the creation process?

- How to you make sure you can manipulate what they create to get the outputs that you need?

If you have a lot of writers contributing, or if your writers are not all professional writers, or if your writers contribute on an infrequent basis, a more controller structure can help guide the writing process more effectively and increase the likelihood that you will get the content that you need.

If you have a small group of writers, or all technical writers, or creating content is part of their day-to-day, looser structure might be better. Professionalism and peer pressure/edits might be sufficient to keep structure consistent.

Looking specifically at content, if your content needs are very predictable, you might be able prefer a more rigid structure. If you content needs change frequently, you might need to maintain a looser content structure.

Structure promotes consistency. (Assuming you get the structures right.) Consistency promotes usability. And that’s the ultimate goal.

Posted by Steve Manning on February 12, 2008 at 10:50 PM PST #

Elizabeth Baker wrote:
> (an intriguing suggestion of reuse capabilities in
> Knuth's non-XML publication environments)
>
Color me seriously interested! Can you point me to an overview of the mechanisms you used to achieve content-recycling? Can write up something and post it somewhere? I'd love to see a diary of the problems you encountered, the solution possibilities you uncovered, and the solution you wound up implementing!

Steve Manning wrote:
>
> (Some marvelous heuristics for determining how much
> structure you need, ending with)
> "Structure promotes consistency.
> Consistency promotes usability.
>
Good stuff! I especially like the size/training of group dimension as a way of determining structure restrictions.
But it's interesting to note that the evaluation perspective comes from the "structure/reuse" side of the equation. The other end of the spectrum is "collaboration". On that dimension, a small group of trained writers can easily work with a very rigid, structured format--especially if they can show each other what to do when the structure seems to limit what they can do. To the degree that training is required to use the more rigid format, it tends to impede collaboration by remote, part-time participants. From that perspective, a looser format is more desirable! (It could that there is an optimum middle ground, or it could be the case that the tension between open document collaboration and modular, reusable structures is essentially unresolvable.)

Posted by Eric Armstrong on February 13, 2008 at 09:53 PM PST #

If only I had the time to describe all I learned and all I was able to do!
Maybe I can briefly whet your appetite by saying that I developed a .sty file (or "package") for each content use -- such as manuals, reference volumes, marketing brochures, application notes... You can view such documentation if you register at www.aplac.com to use the APLAC Solutions downloads site. I could then use the same content .tex files for multiple documents. Some would include more or less content, depending on the way the content was defined.
Each .sty file included definitions/macros, and could also call on other .sty files, to optimize modularity. I developed my own .sty files but I also used .sty files written by others in the LaTeX community -- and pdfLaTeX was compatible with the same .sty files. LaTeX is addictive when you get into it, because of its powerful versatility! The LaTeX article in wikipedia has a number of excellent leads/links.
Happy Valentine's Day!

Posted by Eli s abeth Baker on February 13, 2008 at 10:43 PM PST #

Eric,

Very thought-provoking article. You raise some interesting questions about structured markup and the the problems with the tools that make the user experience with structured markup languages less than optimal. I'd welcome your thoughts to my response posted on my blog.

Posted by Jim Earley on April 14, 2008 at 03:30 PM PDT #

Thought provoking article and an interesting chain of responses.

My concern is that I can't figure out who the audience is for the kind of system you're describing.

Seems to me that anyone who can deal with complex techniques like transclusion, reuse, versioning and conditional meta-data, will find all of these techniques easier to use with structured content. And anyone who can't deal with structured content probably can't or won't deal with these complex techniques anyway.

Trying to implement these methods without structure is a bit like trying to write an object-oriented program in assembly language because you don't want to learn Java or C++. Sure, you can do it, but if you understand and can work with object-oriented concepts, learning the most appropriate language is not a big deal, and will be a big win in the long run.

In practice, I think you can get away with really simple transclusion and reuse without serious structure (Wikipedia is a clear example of that). But, I think it's unrealistic to try and do complex transclusion, reuse, etc. without structure.

Posted by Dick Hamilton on April 15, 2008 at 05:08 PM PDT #

As an interesting follow up to this post, I just came back from the DITA CMS conference, where I found out about a WYSIWYG Wiki that produces books--and which is reported to have the features that enable serious reuse (transclusion and conditionals). That system is Daisy. I blogged about it here: http://blogs.sun.com/coolstuff/entry/daisy_wysiwyg_wiki_for_pdf

Posted by Eric Armstrong on April 17, 2008 at 02:08 AM PDT #

Jim: In response to your Sun, 13 Apr 2008 blog reply at http://jims-thoughtspot.blogspot.com/, here are my thoughts:

1. Way good stuff. Well thought, nicely stated.

2. Lack of interoperability among Wikis is a
really good point.

3. Again, though, there is a tension between
collaboration and structure. If you only
think of /writers/, structure gives great
returns for a small investment. But when
you want active collaboration with engineers
and users, it's an impediment if it takes
the equivalent of a college course to use
the system.

4. Semantic markup definitely adds capabilities
you don't get with presentational markup.
A post on the subject is here:
http://blogs.sun.com/coolstuff/entry/value_of_semantic_tags

5. A minor correction: I have yet to find an HTML
editor that didn't require me to go into the
tags at various times to get the results I
really wanted. The same is true for DITA.
But OpenOffice, Frame, and Word are different.
I've never /had/ to go into the markup.
(Although I occasionally found it useful to
automate operations by processing MIF, it
was a never a necessity for interactive
editing.)

6. So there is something about XML markup or
their editors that keep the tags from being
transparent. Maybe it's just early in the
development curve? Not sure.

7. I love this:
"It isn't that markup languages themselves
are difficult. Rather, it's that the tools
that we use to create the underlying markup
are perhaps too difficult for authors to use.
Bang! Index that under "Nail, hitting of, on
the head, even."

8. Another really good point: Transclusions and
conditionals as the path to dynamic delivery
of highly customized content. (From the
supplier's perspective, it's "customized".
From the user's perspective, it's truly
/useful/.)

Again, great post, and a tremendous addition to
the discussion, all centered on the idea of a collaborative journey towards the "truth", whatever it may happen to be.

There is a clearly an optimal balance between
structure and ease of collaboration, as well as a balance between redundancy and transclusion. For example, I could transclude the phrase "as well as" to keep from repeating myself, but it's pretty clear that the gain is not worth the cost.

So if there is a balance to be achieved, there must be some way of characterizing it. Once we've done that, we have a razor that tells us whether we would rather have redundancy or transcludability.

And if we can get to the point that the tools are so good that the tags remain invisible, structure becomes a no-brainer. Otherwise, there is a balance to be struck there, as well.

Posted by Eric Armstrong on April 17, 2008 at 02:34 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today