Q: Is the CMS itself going to be tidied on the fly, or is it the CMS *content* that is to be tidied?
Good question. It's just the content that's being tidied at the moment (although, eventually, when online editing of the CMS code goes live we'll obviously want to tidy that too - for pure convenience of makign it easier to edit! At least I, personally, am not going to edit raw HTML templates without some tidying
So, you're right: the textarea bug is not an issue right now. Gah. Stressed, tired. I'm making stupid mistakes!
Anyway, FYI, tidying is currently only necessary for these use-cases:
1. User who tries to insert malicious HTTP code; need to prune or nullify nasty tags
2. User who tries to insert undesirable but non malicious HTTP code; use of IMG or A HREF tags in places where the only possible rason for them is to promote spam etc
...in both these cases, the tidying can be intensive. In both these cases, the check is necessary "once only", but the output will be read thousands if not millions of times.
Hence, check *must* be on the server, and needs to be prior to storing in DB. This assumes that pages are read a lot more often than they are modified (which seems fair)
3. Nice user who is crap at writing HTML
4. Nice user who copies/pastes HTML from a word document
...only difference is that it would be "safe" to do the check on the client browser since these cases the user does not want to circumvent it
...we're back to cases 1 and 2 again, as far as "where" the tidying can be done
So, tidying needs to be:
- at form-submit time
- good at removing nasties
At the moment, we have a custom set of filters written using REGEXPs (because they're so damn good at filtering nasty HTML; very easy to cover all the random deliberate attempts to fool HTML parsers WITHOUT having to construct full HTML parse trees) and attempting to run Tidy as a final-stage process that does most of the "co-ercing into well-formed HTML, using only modern tags, and undoind any Word/etc nastiness".
Although Tidy does the full tree parse I don't trust the security of the HTML to it. Yet. Maybe I will in the future; at any rate, the regexp security parsing was added before starting with Tidy, so we might as well keep it for now.