开发者

Regular expression to remove endline space patterns

I have a website updater that converts each p element to a textarea, the user types in the content then each textarea is converted back to p & I grab the resulting HTML & store that in my SQL database.

My Problem: In Internet Explorer, when I go to grab the HTML back it has slightly changed the html. For example:

// From this originally
<img id="headingpic"/><div id="myContent">  

// To this
<img id="headingpic"/>
<div id="myContent">

This matters because now on display there is a vertical gap between the img & the div below.

Sometimes IE inserts an "\n ", sometimes its an " \n" sometimes its just an "\n". I am trying to come up with a regular expression to remove these endlines(& spacing) no matter their pattern. I have ALOT of difficulty开发者_运维百科 coming up regular expressions, they seem so cryptic to me.

If I explain my algorithm can you suggest the "character" that acheives this in regular expressions?

  • For every ">" character: IGNORING ANY WHITEPACE OR ENDLINE CHARACTERS if the next character is an "<" then proceed
  • For every char behind "<" if it is not == ">" delete it(or replace it with "")

I am trying to do this in either javascript or python:

# Python: should I use replace for this? Would my regular expression look something like this?
HTML_CONTENT.replace( "^[ \t\n\r]" ) # this removes all whitespace as far as I know


I would go about this a different way:

firstly spilt by line.

html_content_list = HTML_CONTENT.split("\n"); // Split by line;

then remove all whitespace on the end with a .trim() (assuming we are talking about strings and one line each, test for null first)

for(var i in html_content_list)
{
    html_content_list[i] = html_content_list[i].trim();
}

then if it really does need a new line add it at the end:

html_content_list.join("\n");


Your regex needs a few more characters, or the \s:

HTML_CONTENT.replace( "^[ \t\n\r\f\v]" )

Or

HTML_CONTENT.replace( "^[\s]" )

\v Matches a vertical tab \u000B.

\f Matches a form feed \u000C.


I misunderstood the question at first, but here is how you would do it it python:

import re
HTML_CONTENT = """\
<img id="headingpic"/> abcdef
qwerty..??,ksjhe173((:$
<div id="myContent">
"""

print re.sub(">[^<]*<", "><", HTML_CONTENT)

Outputs:

<img id="headingpic"/><div id="myContent">  

Or, if you just want to remove white space and newlines:

import re
HTML_CONTENT = """\
<img id="headingpic"/>

<div id="myContent">
"""

print re.sub(">[\s]*<", "><", HTML_CONTENT)

Outputs:

<img id="headingpic"/><div id="myContent">
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜