Friday, January 14, 2011

Damn TCLLIB htmlparse

I recently needed to write a quick script to extract some information from HTML documents. Poking around I found the machine already had TCLLIB installed so I thought it would be a good opportunity to try out the htmlparse library. The library is extremely easy to use except for one annoying pitfall, the handling of attributes. The parser will invoke a callback function with the tag name, text, and attributes. A default function is provided called ::htmlparse::debugCallback. From the documentation of the param argument:
The fourth argument, param, contains the un-interpreted list of parameters to the tag.
I thought for sure they must be kidding. Do they really expect me to parse the attribute data myself? The reason I'm using a library is so I don't have to worry about all the intricacies of HTML parsing. I decided to give it a try with the example below:
#!/usr/bin/env tclsh

package require htmlparse 1.2

set html {
<html>
  <head><title>Test HTML Page</title></head>
  <body>
    <p>This is some test <a target = "_blank"
    href="http://w3.org/html"         >HTML</a> content.</p>
  </body>
</html>
}

::htmlparse::parse $html
The output of running this example:
$ ./htmlparse.tcl 
==> hmstart {} {} {
}
==> html {} {} {
  }
==> head {} {} {}
==> title {} {} {Test HTML Page}
==> title / {} {}
==> head / {} {
  }
==> body {} {} {
    }
==> p {} {} {This is some test }
==> a {} {target = "_blank"
    href="http://w3.org/html"         } HTML
==> a / {} { content.}
==> p / {} {
  }
==> body / {} {
}
==> html / {} {
}
==> hmstart / {} {}
Sure enough, the attributes are all in one big string just as the documentation stated. This is one of those times I was hoping the documentation was wrong.

No comments:

Post a Comment