/* OVERVIEW:
*
* Wahstml take an untrusted HTML and return a safe html string.
*
* SYNOPSIS:
*
* washtml($html, $config, $full);
* It return a sanityzed string of the $html parameter without html and head tags.
* $html is a string containing the html code to wash.
* $config is an array containing options:
* $config['allow_remote'] is a boolean to allow link to remote images.
* $config['cid_map'] is an array where cid urls index urls to replace them.
* $config['charset'] is a string containing the charset of the HTML document if it is not defined in it.
* $full is a reference to a boolean that is set to true if no remote images are removed. (FE: show remote images link)
*
* INTERNALS:
*
* Only tags and attributes in the globals $html_elements and $html_attributes
* are kept, inline styles are also filtered: all style identifiers matching
* /[a-z\-]/i are allowed. Values matching colors, sizes, /[a-z\-]/i and safe
* urls if allowed and cid urls if mapped are kept.
*
* BUGS: It *MUST* be safe !
* - Check regexp
* - urlencode URLs instead of htmlspecials
* - Check is a 3 bytes utf8 first char can eat '">'
* - Update PCRE: CVE-2007-1659 - CVE-2007-1660 - CVE-2007-1661 - CVE-2007-1662
* CVE-2007-4766 - CVE-2007-4767 - CVE-2007-4768
* http://lists.debian.org/debian-security-announce/debian-security-announce-2007/msg00177.html
* - ...
*
* MISSING:
* - relative links, can be implemented by prefixing an absolute path, ask me
* if you need it...
* - ...
*
* Dont be a fool:
* - Dont alter data on a GET: '<img src="http://yourhost/mail?action=delete&uid=3267" />'
* - ...
*/
Thank you.
Frederic Motte
Liazo.fr, high avaibility & security for applications, systems and networks