Attempting to tame StarOffice HTML
By Chris Quenelle on Aug 28, 2006
One of the tasks I have to do every so often is to maintain a 75 page HTML document. It's a manual for an API that I'm responsible for. Over the years, I've used Netscape, or Mozilla or Star Office to update the manual. It seems everytime I edit it (usually every 3 to 6 months), the output when I save it is completely different in terms of the indenting and the nature and amount of extra cruft that gets put in. I've always accepted this as the way things have to be, because I can understand the difficulty in mapping between a text editor's notion of rich text and HTML's notion of rich text.
What pushed me over the edge is the problem of adding a new function to my library and adding a new paragraph of documentation at the same time. I want to be able to diff the two HTML files (before and after the addition) and see what changes were made. I don't mind if it's got markup in it, I just want a reminder of what change was made, so a plain text diff is okay with me.
But everytime I try this, I get a diff on about every single line of the entire file. The line breaks are completely different or new style properties get added to every paragraph even though I didn't change the actula styles in the editor. Bummer.
Of course, if I turn around and make another change immediately, I'll get a reasonable diff. The problem happens because I only update the docs every 3-6 months. In between it's quite common things to happen:
- I accidentally use a different version of star office
- Someone else updates the manual using some unknown editor
- I upgrade my desktop machine and get a different micro version of Star Office
I was thinking about this recently, and decided to try playing around with Star Office to see if I could get it to generate a very simple and stripped down HTML.
Part of that idea was trying to convince it to use CSS, and to reload the CSS from the HTML file and recreate its own styles. I considered this a pretty far-fetched idea, because I assumed the internal Star Office to HTML translation was basically one way. I assumed all the higher level structures (like predefined styles) would be lowered to HTML and forgotten.
It turns out I was pleasantly surprised. Star Office 8 can basically do what I want. It still has a little cruft it wants to inject to enforce its own "default" settings. Specifically, <P CLASS="myformat-western" STYLE="margin-bottom: 0.2in"> and <P STYLE="margin-bottom: 0in; page-break-before: auto"> seem to be common things that sneak into your HTML source.
When you create a new style, if you want it to be stored in the CSS you have to "link" it with an existing style that represents an HTML tag. For example, a paragraph format could be linked with "Text body". Linking your new style with "Default" (the default) won't get it saved as CSS.
You can use the "Styles and Formatting" dialog to change between paragraph and character styles (a standard feature in word processors) but there doesn't seem to be a way to use that dialog to remove formats. You have to use the menu bar under Format->Default Formatting. I think it's very easy for Star Office to accidentally translate a style-oriented change to a paragraph into individual-paragraph-oriented change. In other words, when I change the margin on a paragraph style to .5 inches (for example) sometimes Star Office would change the style, but also assign that specific margin to each paragraph instance that was tagged with the style. This adds to the number of individual crufty style blobs floating around in your document.
In order to get rid of all the margin-bottom settings, I added an explicit margin-bottom setting to my own custom styles, and then decided to use my own custom styles for all the most common paragraph styles in my manual.
But there was no way to control the page-break-before property in the HTML editing mode of Star Office Writer. If you are editing a Star Office native ODT document then you have some "Text Flow" options. But all those options seem to vanish from all the various menus when you're editing an HTML document. It's a little confusing because all the docs still talk about them.
But I did an experiment, where I removed all the "page-break-before" properties from the HTML file on disk, and then reloaded the file and added some more paragraphs. It seems that Star Office won't recreate them after you remove them. So I think I'm ready to try converting my document to use new CSS styles.
We'll see how it goes.
Some time later ...
One important feature on your new custom styles is "autoupdate". This is important while you're fiddling around with formats. But when you save/reload your HTML it gets reset to off.
Hmmmm.... It seems that some properties (like margins) will automatically get propogated to all paragraphs with the same style. But some properties (like spacing above and below paragraph) will not.
I spent a lot of time updating a single style and then paint-brushing the same style onto all the instances that already had that style. I think I could have gotten the same effect by just removing all the floating style blobs using a text editor and then reloading.
Yes, that works better. I had "margin-xxx properties sneak into my document somehow. If I save the doc as HTML and then delete all the explicit style settings with "margin-\*" then reload it, all the paragraphs start to respond globally to changes (you still need to mark them AutoUpdate, I assume).
Okay, I think I've got a reformatted manual that will work for me. I needed to liberally use a regular-expression capable editor to strip out all the "style" tags that kept creeping back in. But once I take them out they seem to stay out pretty well.