sed and white space removal
By jmccabe on Aug 02, 2007
Several months ago I wrote about an NSAPI Output filter that could be used to strip extra white space from HTML. At the time I also commented that with the release of Web Server 7.0 there would surely be a better (or at least easier) way to do the same task.
After much procrastination I've decided to see if I can abuse the now built-in Output sed filter of Web Server 7.0 to do the same thing. here's what I've got so far.
Output fn="insert-filter" type="text/html" filter="sed-response" sed="s/[ ][ ][ ]\*/ /g" sed="s/\^[ ]//g"
The bracketed characters within the two sed filters is [SpaceTab].
The first expression will catch two or more of space or tab in a row (e.g. SpaceSpaceTab, TabSpace, TabTabTab, etc) and replace it with a single space. The second expression will catch a single space at the beginning of a line and remove it (since an artifact of the first expression is that it can frequently leave behind lines that begin with a single space).
Generally I'm pleased with the solution. My only complaint is that unlike the NSAPI Filter I linked to previously, this solution does not take into account whether the spaces are actually relevant for presentation (if they live inside a <pre> tag, textarea, etc). Generally this isn't a problem unless I want to do an approximation of ASCII art, or cut/paste source code and have it end up readable on the final HTML page.
Still ... The test file Igor provided went from 674 bytes to 529, and with GZip compression on top of that we're down to 291, so that's kinda cool.
Anyone have some thoughts on how to mangle the expression to leave <pre>, <tt> and others alone?
I have to admit that I'm finding myself being all grumpy as I rediscover the limitations of sed. I would have preferred a more compact Regular Expression solution, ala VI:
Alas, this is not to be.