You are here: Home Magazine 2000 issues September 2000 Now starring: HTML bloat

Now starring: HTML bloat

Cleaning up after Star's HTML export.

by Julian Thomas

As a volunteer for Montezuma National Wildlife Refuge, one of the tasks I have taken on is the construction and maintenance of the group's Web site. This year, some other volunteers have instituted once-a-month refuge-wide bird counts, and I was asked to put the results on the Web site. I agreed, providing that the data was provided to me in machine-readable form, and now every month an Excel spreadsheet arrives in my email.

No problem, I said, invoking Star Office 5.1, which loaded the spreadsheet and let me look it over. Figure 1 shows the first few rows; there are no formulae, and the spreadsheet was used only as a convenience for data entry. I discovered that Star would allow me to directly export this data into HTML, so I jumped on that option. Ah! I could add a little bit of header and trailer information, and the job was going to be even easier than I had thought. Not surprisingly, the spreadsheet was rendered as an HTML table.

Unfortunately, Star insisted on loading each table cell, even the empty ones, with a great deal of formatting information. For instance, here is the first detail line of the table from the HTML file.

 <TR>
   <TD WIDTH=401 HEIGHT=21 ALIGN=LEFT VALIGN=BOTTOM>
     <B><FONT FACE="Arial" SIZE=3>Pied-billed Grebe</FONT></B></TD>

   <TD WIDTH=90 HEIGHT=21 ALIGN=LEFT VALIGN=BOTTOM SDVAL="2" SDNUM="1033;">
     <FONT FACE="Arial>2</FONT></TD>
   <TD WIDTH=76 HEIGHT=21 ALIGN=LEFT VALIGN=BOTTOM><FONT FACE="Arial">
     <BR></FONT></TD>

   <TD WIDTH=79 HEIGHT=21 ALIGN=LEFT VALIGN=BOTTOM SDVAL="2" SDNUM="1033;">
     <FONT FACE="Arial">2</FONT></TD>
   <TD WIDTH=79 HEIGHT=21 ALIGN=LEFT VALIGN=BOTTOM><FONT FACE="Arial">
     <BR></FONT></TD>

   <TD WIDTH=79 HEIGHT=21 ALIGN=LEFT VALIGN=BOTTOM><FONT FACE="Arial">
     <BR></FONT></TD>
   <TD WIDTH=79 HEIGHT=21 ALIGN=LEFT VALIGN=BOTTOM><FONT FACE="Arial">
     <BR></FONT></TD>

   <TD WIDTH=79 HEIGHT=21 ALIGN=LEFT VALIGN=BOTTOM><FONT FACE="Arial">
     <BR></FONT></TD>
  </TR>

Note that if there is any data, it shows up right after the "Arial">. The entire file came to almost 97KB, which is unpleasant to view without a broadband Internet connection.

To reduce the bloat, I whipped up a small Rexx program, destar.cmd.

 /* destar.CMD: strip bloat from star office html;
    jt didit Apr 2000 */

 call RxFuncAdd 'SysLoadFuncs','RexxUtil' ,'SysLoadFuncs'
 call SysLoadFuncs

 tempout="temp$tar"

 parse arg infile outfile .
 if infile='' then do
   do while queued()>0
     pull junk
   end
   say ' destar.cmd. Copies infile to outfile and removes'
   say ' bloat from star-generated html table data lines'
   say ' If only one argument, file is replaced with processed version'
   say ' syntax is DESTAR infile [outfile]'
   exit
   end

 if outfile = '' then outfile=tempout

 say ' Converting' infile 'to' outfile

 c=stream(infile,'C','OPEN READ')
 c=SysFileDelete(outfile)
 c=stream(outfile,'C','OPEN WRITE')

 do forever while lines(infile)>0
   lin=linein(infile)
   lin=strip(lin,'L',"09"x)
   lin=strip(lin)
   parse var lin first rest
   if first="<TD" then lin=fixlin(lin)
   call lineout outfile,lin
   end

 xx=lineout(infile)
 xx=lineout(outfile)

 if outfile=tempout then do
   say 'Copying' outfile 'to' infile
   'copy' outfile infile 
   call SysFileDelete outfile
   end
 exit

 fixlin: PROCEDURE
   parse arg td rest 
   stop=pos('>',rest)
   work=td||'>'||substr(rest ,stop+1)
   z= pos('<FONT',work)
   if z>0 then do
     work2=left(work,z-1)
     zz=pos('>',work,z)
     work=work2||substr(work,zz+1)
     zz=pos('</FONT>;',work)
     if zz>0 then work = left(work,zz-1)||substr(work,zz+7)
     end /* font handling */
   return work
   end

The result:

  <TD><B>Pied-billed Grebe</B></TD>
  <TD>2</TD>

  <TD><BR></TD>
  <TD>2</TD>
  <TD><BR></TD>
  <TD><BR></TD>

  <TD><BR></TD>
  <TD><BR></TD>
  </TR>

The file, after running the program, is down to less than 18KB.

At WarpTech, I mentioned this issue at the Sundial booth. Carla Hanzlik observed that Star was probably working very hard to reproduce the original formatting information, and suggested that I try the same thing with my brand new copy of Mesa2. The same line, as exported to HTML by Mesa looks like this:

  <TR VALIGN=bottom ALIGN=left>
  <TD>&#160;</TD>
  <TD nowrap><FONT color=#000000><B>Pied-billed Grebe</B></FONT></TD>

  <TD ALIGN=right VALIGN=top nowrap><FONT color=#000000>2</FONT></TD>
  <TD>&#160;</TD>
  <TD ALIGN=right VALIGN=top nowrap><FONT color=#000000>2</FONT></TD>

  <TD>&#160;</TD>
  <TD>&#160;</TD>
  <TD>&#160;</TD>
  <TD>&#160;</TD>

  </TR>

That's considerably less bloated than the Star output (the output.htm file is 43KB), but it still contains essentially unnecessary formatting information. I'll probably switch to using Mesa in the future, since it loads dramatically faster than Star, but will still process the output using some Rexx code to strip it down even further.

Document Actions