开发者

Rails - strip_tags - Not catching DOCTYPE?

Given an HTML email, I'm using the following to strip down to just the text:

  body = body.gsub(/\\r\\n?/, "\n");
  body = body.gsub(/\\n\\n?/, "\n");
  body = simple_format(body)
  body = strip_tags(body)

But I'm now seeing that one tag gets passed this:

<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\">

Which outputs like so:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.开发者_如何学Python01 Transitional//EN">

Any ideas why?


I guess for strip_tags, which looks like it's been deprecated, considers the doctype statement neither a tag, nor a comment. You could strip it out separately:

string.gsub(/<!.*?$/,'')


I ended up using Hpricot to text, worked great


I'd recommend using Nokogiri for your parsing needs. It's very well supported, plenty fast, very flexible, and the basis of a lot of other HTML/XML type gems. It has a Hpricot mode, though I'm not sure why anyone would need that as its syntax is more full-featured.

In particular, to strip tags from HTML, I'd recommend looking into Loofah. It can whitelist tags, and has several layers of cleansing it can do.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜