开发者

remove whitespace from html document using ruby

So I have a string in ruby that is something like

str = "<html>\n<head>\n\n  <title>My Page</title>\n\n\n</head>\n\n<body>" +
      "  <h1>My Page</开发者_如何学JAVAh1>\n\n<div id=\"pageContent\">\n  <p>Here is a para" +
      "graph. It can contain  spaces that should not be removed.\n\nBut\n" +
      "line breaks that should be removed.</p></body></html>"

How would I remove all whitespace (spaces, tabs, and linebreaks) that is outside of a tag/not inside a tag that has content like <p> using only native Ruby?

(I'd like to avoid using XSLT or something for a task this simple.)


str.gsub!(/\n\t/, " ").gsub!(/>\s*</, "><")

That first gsub! replaces all line breaks and tabs with spaces, the second removes spaces between tags.

You will end up with multiple spaces inside your tags, but if you just removed all \n and \t, you would get something like "not be removed.Butline breaks", which is not very readable. Another Regular Expression or the aforementioned .squeeze(" ") could take care of that.


Hate to split hairs about regexen, but none of the other answers are strictly correct. This will work:

str.gsub(/>\s*/, ">").gsub(/\s*</, "<")

Explicitly converting newlines is unnecessary because /\s/ matches all whitespace characters including newline. The regexen in the other answers are not strictly correct because their regexen fail to match "\r", which is used at the end of lines in Windows and will appear in emails.

My line will also convert <p> foo bar </p> into <p>foo bar</p>, but you may not want this.


You can condense all groups of space characters into one space (ie, hello world into hello world) by using String#squeeze:

"hello     world".squeeze(" ")  # => "hello world"

Where the parameter of squeeze is the character to be squeezed.

EDIT: I misread your question, sorry.

This would

  • remove consecutive spaces within tags
  • leave individual spaces outside tags

I'll work on a solution right now.


xml.squish.gsub /(> <)/, '><'

Even shorter than above.

PS I love the funny faces.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜