how to make the width and height x2 using python Regular
i have to do many work to change like this:
<img src = "/" height="111" width="10" />
to
<img src = "/" height="222" width="20" />
so i want to use python Regular this is my code :
import re
s = '<img src = "werwerwe" height="111" width="10" />'
def a(x):
print x.group(2)
print x.group(4)
ss = re.sub(r'''<img.*(width\s*="?(\d+)"?)*\s*(height\s*="?(\d+)"?)*''',a, s)
print ss
so what can i do ,
thanks
updated:
it is ok now :
import re
s = '<img src = "/" height="1开发者_如何学Python11" width="10" />'
def a(x):
b = x.group(0)
b = b.replace(x.group(1),str(int(x.group(1))*2))
b = b.replace(x.group(2),str(int(x.group(2))*2))
return b
ss = re.sub(r'''<img.*?height=\"(\d+)\".*?width=\"(\d+)\"[^>]*>''',a, s)
print ss
Don't use regular expressions to parse HTML. Use BeautifulSoup
>>> from BeautifulSoup import BeautifulSoup
>>> ht = '<html><head><title>foo</title></head><body><p>whatever: <img src="foo/img.png" height="111" width="22" /></p><ul><li><img src="foo/img2.png" height="32" width="44" /></li></ul></body></html>'
>>> soup = BeautifulSoup(ht)
>>> soup
<html><head><title>foo</title></head><body><p>whatever: <img src="foo/img.png" height="111" width="22" /></p><ul><li><img src="foo/img2.png" height="32" width="44" /></li></ul></body></html>
>>> soup.findAll('img')
[<img src="foo/img.png" height="111" width="22" />, <img src="foo/img2.png" height="32" width="44" />]
>>> for img in soup.findAll('img'):
... ht = int(img['height'])
... wi = int(img['width'])
... img['height'] = str(ht * 2)
... img['width'] = str(wi * 2)
...
...
>>> print soup.prettify()
<html>
<head>
<title>
foo
</title>
</head>
<body>
<p>
whatever:
<img src="foo/img.png" height="222" width="44" />
</p>
<ul>
<li>
<img src="foo/img2.png" height="64" width="88" />
</li>
</ul>
</body>
</html>
>>>
Don't use regular expressions when dealing with HTML. Parse it properly with something like lxml.
import lxml.html
html = '<img src = "werwerwe" height="111" width="10" />'
etree = lxml.html.fromstring(html)
images = etree.xpath('//img')
for image in images:
h = int(image.attrib['height'])
w = int(image.attrib['width'])
image.attrib['height'] = str(h*2)
image.attrib['width'] = str(w*2)
print lxml.html.tostring(etree)
Gives:
<img src="werwerwe" height="222" width="20">
Disclaimer: I agree that parsing HTML is best performed using an HTML parser. However, the poster has specifically asked for a regex solution, and this particular problem presents a good vehicle to demonstrate a clever (and little-known) regex technique that is quite handy.
But first, there is a logic error in the original function. It blindly performs its numerical replacement which results in erroneous results when the WIDTH is exactly half the HEIGHT, e.g. Given the following:
<img src = "/" width="10" height="20" />
The original posted program returns the following erroneous result:
<img src = "/" width="40" height="40" />
The problem is that WIDTH gets doubled twice. Additional logic is needed to guarantee correct replacement.
A cool regex trick you may not know:
Here is a modified version of the original program which fixes the above mentioned error and includes a (commented) version of an improved regex:
import re
s = '<img src = "/" width="10" height="111" />'
def a(x):
b = x.group(0)
if x.group(1):
b = b.replace(x.group(1),
"width=\""+ str(int(x.group(2))*2) +"\"")
if x.group(3):
b = b.replace(x.group(3),
"width=\""+ str(int(x.group(4))*2) +"\"")
return b
reobj = re.compile(r'''
<img # Start of IMG tag.
(?: # Group for multiple attributes.
\s+ # Attributes separated by whitespace.
(?: # Group for attribute alternatives.
(width\s*=\s*"(\d+)") # $1: WIDTH attribute, $2 value.
| (height\s*=\s*"(\d+)") # $3: HEIGHT attribute, $4 value.
|[^\s>]+) # Other IMG attributes.
)+ # One or more attributes.
[^>]*> # End of IMG tag.
''', re.IGNORECASE | re.VERBOSE)
ss = re.sub(reobj, a, s)
print ss
Note that the WIDTH gets captured into groups $1 and $2 and HEIGHT into groups $3 and $4, even if their order is reversed in the target string. I wish I could say that I thought up this cool trick, but I didn't. I stole it from one of Steven Leveithan's excellent blog posts: Capturing Multiple, Optional HTML Attribute Values. Pretty nifty eh?
A cleaner regex solution
Clever as that may be, it is still more complex than it needs to be for this job. I would keep it simple and just do two separate replace operations like so:
import re
s = '<img src = "/" width="10" height="111" />'
def a(x):
return x.group(1) + str(int(x.group(2))*2)
ss = re.sub(r"(?i)(<img[^>]*?width\s*=\s*[\"'])(\d+)",a, s)
ss = re.sub(r"(?i)(<img[^>]*?height\s*=\s*[\"'])(\d+)",a, ss)
print ss
Smaller. cleaner, easier to read and probably the fastest solution. (Note that the callback function becomes trivial.)
Nothing good will come from attempting to use regex to parse HTML. No matter what you do, it will eventually break.
So, use an html parser like python's HTMLParser, it will decode all of the HTML text and you just need to print it back out with your changes.
On another note, modifying html like you are doing looks suspicious. You are probably doing something the very hard way.
Once again, that task should be solved perfectly by a HTML Parser like suggested here and here.
If you still want to use a Regular Expressions for that purpose, you can use this one instead:
<img.*?(width|height)=\"(\d+)\".*?(width|height)=\"(\d+)\"
For example:
In text: <img src = "/" width="10" height="111"/>
will match the following groups:
- Group 1:
"width"
- Group 2:
"10"
- Group 3:
"height"
- Group 4:
"111"
In text: <img src = "/" height="111" width="10"/>
it will match:
- Group 1:
"height"
- Group 2:
"111"
- Group 3:
"width"
- Group 4:
"10"
Now it matches no matter if width
is before height
or viceversa, and I think the 4 groups give you enough info when doing the replacement.
Edit:
I captured the groups height
and width
for you to know which value matched first (otherwise, if you obtain 111
and 10
you won't know which one is the height
and the width
), but I don't think that's necessary in your case because all you have to do is duplicate both values, but could be useful in case you want increment height
and width
in different values.
Try with the following regex:
<img.*?height=\"(\d+)\".*?width=\"(\d+)\"
Group 1
will capture the height and Group 2
the width
精彩评论