开发者

how to make the width and height x2 using python Regular

i have to do many work to change like this:

<img src = "/" height="111" width="10" />

to

<img src = "/" height="222" width="20" />

so i want to use python Regular this is my code :

import re

s = '<img src = "werwerwe" height="111" width="10" />'

def a(x):
    print x.group(2)
    print x.group(4)

ss = re.sub(r'''<img.*(width\s*="?(\d+)"?)*\s*(height\s*="?(\d+)"?)*''',a, s)

print ss

so what can i do ,

thanks

updated:

it is ok now :

import re

s = '<img src = "/" height="1开发者_如何学Python11" width="10" />'


def a(x):
    b = x.group(0)
    b = b.replace(x.group(1),str(int(x.group(1))*2))
    b = b.replace(x.group(2),str(int(x.group(2))*2))
    return b

ss = re.sub(r'''<img.*?height=\"(\d+)\".*?width=\"(\d+)\"[^>]*>''',a, s)

print ss


Don't use regular expressions to parse HTML. Use BeautifulSoup

>>> from BeautifulSoup import BeautifulSoup
>>> ht = '<html><head><title>foo</title></head><body><p>whatever: <img src="foo/img.png" height="111" width="22" /></p><ul><li><img src="foo/img2.png" height="32" width="44" /></li></ul></body></html>'
>>> soup = BeautifulSoup(ht)
>>> soup
<html><head><title>foo</title></head><body><p>whatever: <img src="foo/img.png" height="111" width="22" /></p><ul><li><img src="foo/img2.png" height="32" width="44" /></li></ul></body></html>
>>> soup.findAll('img')
[<img src="foo/img.png" height="111" width="22" />, <img src="foo/img2.png" height="32" width="44" />]
>>> for img in soup.findAll('img'):
...     ht = int(img['height'])
...     wi = int(img['width'])
...     img['height'] = str(ht * 2)
...     img['width'] = str(wi * 2)
...     
... 
>>> print soup.prettify()
<html>
 <head>
  <title>
   foo
  </title>
 </head>
 <body>
  <p>
   whatever:
   <img src="foo/img.png" height="222" width="44" />
  </p>
  <ul>
   <li>
    <img src="foo/img2.png" height="64" width="88" />
   </li>
  </ul>
 </body>
 </html>
>>> 


Don't use regular expressions when dealing with HTML. Parse it properly with something like lxml.

import lxml.html

html = '<img src = "werwerwe" height="111" width="10" />'

etree = lxml.html.fromstring(html)

images = etree.xpath('//img')
for image in images:
    h = int(image.attrib['height'])
    w = int(image.attrib['width'])
    image.attrib['height'] = str(h*2)
    image.attrib['width'] = str(w*2)

print lxml.html.tostring(etree)

Gives:

<img src="werwerwe" height="222" width="20">


Disclaimer: I agree that parsing HTML is best performed using an HTML parser. However, the poster has specifically asked for a regex solution, and this particular problem presents a good vehicle to demonstrate a clever (and little-known) regex technique that is quite handy.

But first, there is a logic error in the original function. It blindly performs its numerical replacement which results in erroneous results when the WIDTH is exactly half the HEIGHT, e.g. Given the following:

<img src = "/" width="10" height="20" />

The original posted program returns the following erroneous result:

<img src = "/" width="40" height="40" />

The problem is that WIDTH gets doubled twice. Additional logic is needed to guarantee correct replacement.

A cool regex trick you may not know:

Here is a modified version of the original program which fixes the above mentioned error and includes a (commented) version of an improved regex:

import re
s = '<img src = "/" width="10" height="111"  />'

def a(x):
    b = x.group(0)
    if x.group(1):
        b = b.replace(x.group(1),
            "width=\""+ str(int(x.group(2))*2) +"\"")
    if x.group(3):
        b = b.replace(x.group(3),
            "width=\""+ str(int(x.group(4))*2) +"\"")
    return b

reobj = re.compile(r'''
    <img                        # Start of IMG tag.
    (?:                         # Group for multiple attributes.
      \s+                       # Attributes separated by whitespace.
      (?:                       # Group for attribute alternatives.
        (width\s*=\s*"(\d+)")   # $1: WIDTH attribute, $2 value.
      | (height\s*=\s*"(\d+)")  # $3: HEIGHT attribute, $4 value.
      |[^\s>]+)                 # Other IMG attributes.
    )+                          # One or more attributes.
    [^>]*>                      # End of IMG tag.
    ''', re.IGNORECASE | re.VERBOSE)

ss = re.sub(reobj, a, s)

print ss

Note that the WIDTH gets captured into groups $1 and $2 and HEIGHT into groups $3 and $4, even if their order is reversed in the target string. I wish I could say that I thought up this cool trick, but I didn't. I stole it from one of Steven Leveithan's excellent blog posts: Capturing Multiple, Optional HTML Attribute Values. Pretty nifty eh?

A cleaner regex solution

Clever as that may be, it is still more complex than it needs to be for this job. I would keep it simple and just do two separate replace operations like so:

import re
s = '<img src = "/" width="10" height="111"  />'

def a(x):
    return x.group(1) + str(int(x.group(2))*2)

ss = re.sub(r"(?i)(<img[^>]*?width\s*=\s*[\"'])(\d+)",a, s)
ss = re.sub(r"(?i)(<img[^>]*?height\s*=\s*[\"'])(\d+)",a, ss)

print ss

Smaller. cleaner, easier to read and probably the fastest solution. (Note that the callback function becomes trivial.)


Nothing good will come from attempting to use regex to parse HTML. No matter what you do, it will eventually break.

So, use an html parser like python's HTMLParser, it will decode all of the HTML text and you just need to print it back out with your changes.

On another note, modifying html like you are doing looks suspicious. You are probably doing something the very hard way.


Once again, that task should be solved perfectly by a HTML Parser like suggested here and here.


If you still want to use a Regular Expressions for that purpose, you can use this one instead:

<img.*?(width|height)=\"(\d+)\".*?(width|height)=\"(\d+)\"

For example:

In text: <img src = "/" width="10" height="111"/> will match the following groups:

  • Group 1: "width"
  • Group 2: "10"
  • Group 3: "height"
  • Group 4: "111"

In text: <img src = "/" height="111" width="10"/> it will match:

  • Group 1: "height"
  • Group 2: "111"
  • Group 3: "width"
  • Group 4: "10"

Now it matches no matter if width is before height or viceversa, and I think the 4 groups give you enough info when doing the replacement.

Edit:
I captured the groups height and width for you to know which value matched first (otherwise, if you obtain 111 and 10 you won't know which one is the height and the width), but I don't think that's necessary in your case because all you have to do is duplicate both values, but could be useful in case you want increment height and width in different values.


Try with the following regex:

<img.*?height=\"(\d+)\".*?width=\"(\d+)\"

Group 1 will capture the height and Group 2 the width

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜