Regular expression to remove endline space patterns
I have a website updater that converts each p element to a textarea, the user types in the content then each textarea is converted back to p & I grab the resulting HTML & store that in my SQL database.
My Problem: In Internet Explorer, when I go to grab the HTML back it has slightly changed the html. For example:
// From this originally
<img id="headingpic"/><div id="myContent">
// To this
<img id="headingpic"/>
<div id="myContent">
This matters because now on display there is a vertical gap between the img & the div below.
Sometimes IE inserts an "\n ", sometimes its an " \n" sometimes its just an "\n". I am trying to come up with a regular expression to remove these endlines(& spacing) no matter their pattern. I have ALOT of difficulty开发者_运维百科 coming up regular expressions, they seem so cryptic to me.
If I explain my algorithm can you suggest the "character" that acheives this in regular expressions?
- For every ">" character: IGNORING ANY WHITEPACE OR ENDLINE CHARACTERS if the next character is an "<" then proceed
- For every char behind "<" if it is not == ">" delete it(or replace it with "")
I am trying to do this in either javascript or python:
# Python: should I use replace for this? Would my regular expression look something like this?
HTML_CONTENT.replace( "^[ \t\n\r]" ) # this removes all whitespace as far as I know
I would go about this a different way:
firstly spilt by line.
html_content_list = HTML_CONTENT.split("\n"); // Split by line;
then remove all whitespace on the end with a .trim()
(assuming we are talking about strings and one line each, test for null first)
for(var i in html_content_list)
{
html_content_list[i] = html_content_list[i].trim();
}
then if it really does need a new line add it at the end:
html_content_list.join("\n");
Your regex needs a few more characters, or the \s:
HTML_CONTENT.replace( "^[ \t\n\r\f\v]" )
Or
HTML_CONTENT.replace( "^[\s]" )
\v Matches a vertical tab \u000B.
\f Matches a form feed \u000C.
I misunderstood the question at first, but here is how you would do it it python:
import re
HTML_CONTENT = """\
<img id="headingpic"/> abcdef
qwerty..??,ksjhe173((:$
<div id="myContent">
"""
print re.sub(">[^<]*<", "><", HTML_CONTENT)
Outputs:
<img id="headingpic"/><div id="myContent">
Or, if you just want to remove white space and newlines:
import re
HTML_CONTENT = """\
<img id="headingpic"/>
<div id="myContent">
"""
print re.sub(">[\s]*<", "><", HTML_CONTENT)
Outputs:
<img id="headingpic"/><div id="myContent">
精彩评论