C# Quickly remove un supported HTML tags from forms

While building a some forms for a web site, I realized that there were some users that were using HTML tags in forms that I wasn’t happy about. I didn’t mind some tags since they were hrmless and gives the users who know HTML some ability to alter presentation. So here is what I did in C# to clean up/remove HTM tags. I am sure there are many ways to improve it (so feel free to.)


// this method will remove *most* malicious code leaving allowed 
// HTML intact
public static string stripHTMLTags(string input)
{
    string output = "";
    // break the comments so someone cannot add an open comment
    input = input.Replace("<!--", "");

    // strip out comments and doctype
    Regex docType = new Regex("<!DOCTYPE[.]*>");
    output = docType.Replace(input, "");

    // add target="_blank" to hrefs and remove parts that are 
    // not supported
    output = Regex.Replace(output, "(.*)", @"$5");

    // strip out most known tags except (a|b|br|blockquote|em|h1|h2|
h3|h4|h5|h6|hr|i|li|ol|p|u|ul|strong|sub|sup)
    Regex badTags = new Regex("< [/]{0,1}(abbr|acronym|address|applet
|area|base|basefont|bdo|big|body|button|caption|center|cite|code|col
|colgroup|dd|del|dir|div|dfn|dl|dt|embed|fieldset|font|form|frame
|frameset|head|html|iframe|img|input|ins|isindex|kbd|label|legend
|link|map|menu|meta|noframes|noscript|object|optgroup|option
|param|pre|q|s|samp|script|select|small|span|strike|style|table
|tbody|td|textarea|tfoot|th|thead|title|tr|tt|var|xmp){1}[.]*>");
    return badTags.Replace(output, "");
}

Here were a couple of web sites that I used as a reference: Regular Expression Reference – http://www.regular-expressions.info/reference.html
A somewhat comprehensive list of HTML tags –http://www.w3schools.com/tags/default.asp

 

Advertisements

4 responses to “C# Quickly remove un supported HTML tags from forms”

  1. Mark says :

    Hi there.

    Many thanks for this. It’s a lifesaver.

    Just one thing:

    I’ve found I had to insert some forward slashes into the ‘hrefs’ line, before the embedded quotes and embedded forward slash.

  2. Emilio says :

    You forgot to specify RegexOptions.IgnoreCase because it could be that the HTML in the input string is not lowercase.

  3. Irodori says :

    The Regard! The Excellent forum! Thank you!
    payday loans http://payday-gl-loans.com
    Excellent forum, added to favorites!
    xanax http://xanax-gl-pills.com
    The Author, you – genius…
    airline tickets http://airline333tickets.com
    What beautiful text and visitors!
    phentermine http://phentermine-gl-pills.com
    The Author, you simply – super hero!

  4. Irodori says :

    Excellent forum with fantastic references and reading…. well done indeed…
    airline tickets http://airline379tickets.com
    Very good web forum, great work and thank you for your service.
    xanax http://xanax-gl-pills.com
    I am glad to find this forum !
    cialis http://cialis-l-pills.com
    Very good contents…
    xanax http://xanax777pills.com
    All the best!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: