开发者

Detect HTML or Javascript in Content with C# and or ASP.NET

ASP.NET has the ability to detect potentially dangerous input from the client. I'd like to use this ability for a different purpose. I have a search engine that pulls content fro开发者_如何学JAVAm our database. Sometimes the content is in html.

I'd like to detect if it is in HTML and the optionally just not display the content because it looks like gobbledygook to the user.

I'm aware that I can use regex to try to detect this. I was hoping since ASP.NET is good at detecting content, there would be method somewhere I can reuse.

What I'm doing now is just HtmlEncoding all of the out from fields known to have html (or that can possibly contain it). However, as stated above, I'd like to avoid showing the user encoded html b/c it's not useful. Instead I'd just not show the content.

Summary:

  1. Detect if content from a database contains html
  2. If it does, just not display it to the user.
  3. Bonus points if there is a suggestion to convert an html fragment into plaintext.
    • something like this: http://www.codeproject.com/KB/HTML/HTML_to_Plain_Text.aspx


If you want to strip out any HTML or javascript I would recommend looking at this sanitize HTML function created by Jeff Atwood:

http://refactormycode.com/codes/333-sanitize-html

It is probably not a complete solution to what you need, but it would be a good place to start.


You can do something like this with jQuery, given a string you can add it to an element as text or html:

var str = '<a href="/path">Link</a>';
$('div').html(str);

will output:

Link

but this:

var str = '<a href="/path">Link</a>';
$('div').text(str);

will output:

<a href="/path">Link</a>


If you control the html generated and stored in the database, you could simply add a bit field to the table, and upon insert set it to 1 or 0 depending on if it is html or not. To figure out if it is html, you could simply just search for several different tags until you find one using the String.Contains method. Here is a list of common html tags.

Update: I would leave off the trailing angle bracket and search for tags like so: <span <div <html, etc.

Update: you could run your html through lynx to get it converted from html to text for display

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜