开发者

Removing spaces and newlines between tags in html (aka unformatting) in python

An example:

<p> Hello</p>
<div>hgello</div>
<pre>
   code
    code
<pre>

turns in something like:

<p> Hell开发者_开发问答o</p><div>hgello</div><pre>
    code
     code
<pre>

How to do this in python? I make also intensive use of < pre> tags so substituting all '\n' with '' is not an option.

What's the best way to do that?


You could use re.sub(">\s*<","><","[here your html string]").

Maybe string.replace(">\n",">"), i.e. look for an enclosing bracket and a newline and remove the newline.


I would choose to use the python regex:

string.replace(">\s+<","><")

Where the '\s' finds any whitespace character and the '+' after it shows it matches one or more whitespace characters. This removes the possibility of the replace replacing

<pre>
    code
     code
<pre>

with

<pre><pre>

More information about regular expressions can be found here, here and here.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜