Yes something very similar to that, thanks!
The cases that are proving troublesome are similar to the following:
1
| <font size=1>foo</font><font size=5>w<font>ww.go</font>ogle.com</font><font size=1>bar</font> |
Where simply stripping out html would result in a string of "foowww.google.combar" which is not the URL a person would easily see when presented with the rendered html.
Making the prefix/postfix text nearly invisible through color changing is also something I would like to be able to overcome, but I realize that will be a bit more difficult.
I may have to write a document parser that has heuristics to determine if two separate tokens should be merged or not based on their associated font/color attributes and what the intervening punctuation was.