Now starring: HTML bloat
Cleaning up after Star's HTML export.
by Julian Thomas
As a volunteer for Montezuma National Wildlife Refuge, one of the tasks I have taken on is the construction and maintenance of the group's Web site. This year, some other volunteers have instituted once-a-month refuge-wide bird counts, and I was asked to put the results on the Web site. I agreed, providing that the data was provided to me in machine-readable form, and now every month an Excel spreadsheet arrives in my email.
No problem, I said, invoking Star Office 5.1, which loaded the spreadsheet and let me look it over. Figure 1 shows the first few rows; there are no formulae, and the spreadsheet was used only as a convenience for data entry. I discovered that Star would allow me to directly export this data into HTML, so I jumped on that option. Ah! I could add a little bit of header and trailer information, and the job was going to be even easier than I had thought. Not surprisingly, the spreadsheet was rendered as an HTML table.
Unfortunately, Star insisted on loading each table cell, even the empty ones, with a great deal of formatting information. For instance, here is the first detail line of the table from the HTML file.
<TR>
<TD WIDTH=401 HEIGHT=21 ALIGN=LEFT VALIGN=BOTTOM>
<B><FONT FACE="Arial" SIZE=3>Pied-billed Grebe</FONT></B></TD>
<TD WIDTH=90 HEIGHT=21 ALIGN=LEFT VALIGN=BOTTOM SDVAL="2" SDNUM="1033;">
<FONT FACE="Arial>2</FONT></TD>
<TD WIDTH=76 HEIGHT=21 ALIGN=LEFT VALIGN=BOTTOM><FONT FACE="Arial">
<BR></FONT></TD>
<TD WIDTH=79 HEIGHT=21 ALIGN=LEFT VALIGN=BOTTOM SDVAL="2" SDNUM="1033;">
<FONT FACE="Arial">2</FONT></TD>
<TD WIDTH=79 HEIGHT=21 ALIGN=LEFT VALIGN=BOTTOM><FONT FACE="Arial">
<BR></FONT></TD>
<TD WIDTH=79 HEIGHT=21 ALIGN=LEFT VALIGN=BOTTOM><FONT FACE="Arial">
<BR></FONT></TD>
<TD WIDTH=79 HEIGHT=21 ALIGN=LEFT VALIGN=BOTTOM><FONT FACE="Arial">
<BR></FONT></TD>
<TD WIDTH=79 HEIGHT=21 ALIGN=LEFT VALIGN=BOTTOM><FONT FACE="Arial">
<BR></FONT></TD>
</TR>
Note that if there is any data, it shows up right after the "Arial">. The entire file came to almost 97KB, which is unpleasant to view without a broadband Internet connection.
To reduce the bloat, I whipped up a small Rexx program, destar.cmd.
/* destar.CMD: strip bloat from star office html;
jt didit Apr 2000 */
call RxFuncAdd 'SysLoadFuncs','RexxUtil' ,'SysLoadFuncs'
call SysLoadFuncs
tempout="temp$tar"
parse arg infile outfile .
if infile='' then do
do while queued()>0
pull junk
end
say ' destar.cmd. Copies infile to outfile and removes'
say ' bloat from star-generated html table data lines'
say ' If only one argument, file is replaced with processed version'
say ' syntax is DESTAR infile [outfile]'
exit
end
if outfile = '' then outfile=tempout
say ' Converting' infile 'to' outfile
c=stream(infile,'C','OPEN READ')
c=SysFileDelete(outfile)
c=stream(outfile,'C','OPEN WRITE')
do forever while lines(infile)>0
lin=linein(infile)
lin=strip(lin,'L',"09"x)
lin=strip(lin)
parse var lin first rest
if first="<TD" then lin=fixlin(lin)
call lineout outfile,lin
end
xx=lineout(infile)
xx=lineout(outfile)
if outfile=tempout then do
say 'Copying' outfile 'to' infile
'copy' outfile infile
call SysFileDelete outfile
end
exit
fixlin: PROCEDURE
parse arg td rest
stop=pos('>',rest)
work=td||'>'||substr(rest ,stop+1)
z= pos('<FONT',work)
if z>0 then do
work2=left(work,z-1)
zz=pos('>',work,z)
work=work2||substr(work,zz+1)
zz=pos('</FONT>;',work)
if zz>0 then work = left(work,zz-1)||substr(work,zz+7)
end /* font handling */
return work
end
The result:
<TD><B>Pied-billed Grebe</B></TD> <TD>2</TD> <TD><BR></TD> <TD>2</TD> <TD><BR></TD> <TD><BR></TD> <TD><BR></TD> <TD><BR></TD> </TR>
The file, after running the program, is down to less than 18KB.
At WarpTech, I mentioned this issue at the Sundial booth. Carla Hanzlik observed that Star was probably working very hard to reproduce the original formatting information, and suggested that I try the same thing with my brand new copy of Mesa2. The same line, as exported to HTML by Mesa looks like this:
<TR VALIGN=bottom ALIGN=left> <TD> </TD> <TD nowrap><FONT color=#000000><B>Pied-billed Grebe</B></FONT></TD> <TD ALIGN=right VALIGN=top nowrap><FONT color=#000000>2</FONT></TD> <TD> </TD> <TD ALIGN=right VALIGN=top nowrap><FONT color=#000000>2</FONT></TD> <TD> </TD> <TD> </TD> <TD> </TD> <TD> </TD> </TR>
That's considerably less bloated than the Star output (the output.htm file is 43KB), but it still contains essentially unnecessary formatting information. I'll probably switch to using Mesa in the future, since it loads dramatically faster than Star, but will still process the output using some Rexx code to strip it down even further.

