how to make the width and height x2 using python Regular

2023-03-02 20:01 问答作者：

i have to do many work to change like this:

<img src = "/" height="111" width="10" />

<img src = "/" height="222" width="20" />

so i want to use python Regular this is my code :

import re

s = '<img src = "werwerwe" height="111" width="10" />'

def a(x):
    print x.group(2)
    print x.group(4)

ss = re.sub(r'''<img.*(width\s*="?(\d+)"?)*\s*(height\s*="?(\d+)"?)*''',a, s)

print ss

so what can i do ,

thanks

updated:

it is ok now :

import re

s = '<img src = "/" height="1开发者_如何学Python11" width="10" />'


def a(x):
    b = x.group(0)
    b = b.replace(x.group(1),str(int(x.group(1))*2))
    b = b.replace(x.group(2),str(int(x.group(2))*2))
    return b

ss = re.sub(r'''<img.*?height=\"(\d+)\".*?width=\"(\d+)\"[^>]*>''',a, s)

print ss

Don't use regular expressions to parse HTML. Use BeautifulSoup

>>> from BeautifulSoup import BeautifulSoup
>>> ht = '<html><head><title>foo</title></head><body><p>whatever: <img src="foo/img.png" height="111" width="22" /></p><ul><li><img src="foo/img2.png" height="32" width="44" /></li></ul></body></html>'
>>> soup = BeautifulSoup(ht)
>>> soup
<html><head><title>foo</title></head><body><p>whatever: <img src="foo/img.png" height="111" width="22" /></p><ul><li><img src="foo/img2.png" height="32" width="44" /></li></ul></body></html>
>>> soup.findAll('img')
[<img src="foo/img.png" height="111" width="22" />, <img src="foo/img2.png" height="32" width="44" />]
>>> for img in soup.findAll('img'):
...     ht = int(img['height'])
...     wi = int(img['width'])
...     img['height'] = str(ht * 2)
...     img['width'] = str(wi * 2)
...     
... 
>>> print soup.prettify()
<html>
 <head>
  <title>
   foo
  </title>
 </head>
 <body>
  <p>
   whatever:
   <img src="foo/img.png" height="222" width="44" />
  </p>
  <ul>
   <li>
    <img src="foo/img2.png" height="64" width="88" />
   </li>
  </ul>
 </body>
 </html>
>>>

Don't use regular expressions when dealing with HTML. Parse it properly with something like lxml.

import lxml.html

html = '<img src = "werwerwe" height="111" width="10" />'

etree = lxml.html.fromstring(html)

images = etree.xpath('//img')
for image in images:
    h = int(image.attrib['height'])
    w = int(image.attrib['width'])
    image.attrib['height'] = str(h*2)
    image.attrib['width'] = str(w*2)

print lxml.html.tostring(etree)

Gives:

<img src="werwerwe" height="222" width="20">

Disclaimer: I agree that parsing HTML is best performed using an HTML parser. However, the poster has specifically asked for a regex solution, and this particular problem presents a good vehicle to demonstrate a clever (and little-known) regex technique that is quite handy.

But first, there is a logic error in the original function. It blindly performs its numerical replacement which results in erroneous results when the WIDTH is exactly half the HEIGHT, e.g. Given the following:

<img src = "/" width="10" height="20" />

The original posted program returns the following erroneous result:

<img src = "/" width="40" height="40" />

The problem is that WIDTH gets doubled twice. Additional logic is needed to guarantee correct replacement.

A cool regex trick you may not know:

Here is a modified version of the original program which fixes the above mentioned error and includes a (commented) version of an improved regex:

import re
s = '<img src = "/" width="10" height="111"  />'

def a(x):
    b = x.group(0)
    if x.group(1):
        b = b.replace(x.group(1),
            "width=\""+ str(int(x.group(2))*2) +"\"")
    if x.group(3):
        b = b.replace(x.group(3),
            "width=\""+ str(int(x.group(4))*2) +"\"")
    return b

reobj = re.compile(r'''
    <img                        # Start of IMG tag.
    (?:                         # Group for multiple attributes.
      \s+                       # Attributes separated by whitespace.
      (?:                       # Group for attribute alternatives.
        (width\s*=\s*"(\d+)")   # $1: WIDTH attribute, $2 value.
      | (height\s*=\s*"(\d+)")  # $3: HEIGHT attribute, $4 value.
      |[^\s>]+)                 # Other IMG attributes.
    )+                          # One or more attributes.
    [^>]*>                      # End of IMG tag.
    ''', re.IGNORECASE | re.VERBOSE)

ss = re.sub(reobj, a, s)

print ss

Note that the WIDTH gets captured into groups $1 and $2 and HEIGHT into groups $3 and $4, even if their order is reversed in the target string. I wish I could say that I thought up this cool trick, but I didn't. I stole it from one of Steven Leveithan's excellent blog posts: Capturing Multiple, Optional HTML Attribute Values. Pretty nifty eh?

A cleaner regex solution

Clever as that may be, it is still more complex than it needs to be for this job. I would keep it simple and just do two separate replace operations like so:

import re
s = '<img src = "/" width="10" height="111"  />'

def a(x):
    return x.group(1) + str(int(x.group(2))*2)

ss = re.sub(r"(?i)(<img[^>]*?width\s*=\s*[\"'])(\d+)",a, s)
ss = re.sub(r"(?i)(<img[^>]*?height\s*=\s*[\"'])(\d+)",a, ss)

print ss

Smaller. cleaner, easier to read and probably the fastest solution. (Note that the callback function becomes trivial.)

Nothing good will come from attempting to use regex to parse HTML. No matter what you do, it will eventually break.

So, use an html parser like python's HTMLParser, it will decode all of the HTML text and you just need to print it back out with your changes.

On another note, modifying html like you are doing looks suspicious. You are probably doing something the very hard way.

Once again, that task should be solved perfectly by a HTML Parser like suggested here and here.

If you still want to use a Regular Expressions for that purpose, you can use this one instead:

<img.*?(width|height)=\"(\d+)\".*?(width|height)=\"(\d+)\"

For example:

In text: <img src = "/" width="10" height="111"/> will match the following groups:

Group 1: "width"
Group 2: "10"
Group 3: "height"
Group 4: "111"

In text: <img src = "/" height="111" width="10"/> it will match:

Group 1: "height"
Group 2: "111"
Group 3: "width"
Group 4: "10"

Now it matches no matter if width is before height or viceversa, and I think the 4 groups give you enough info when doing the replacement.

Edit:
I captured the groups height and width for you to know which value matched first (otherwise, if you obtain 111 and 10 you won't know which one is the height and the width), but I don't think that's necessary in your case because all you have to do is duplicate both values, but could be useful in case you want increment height and width in different values.

Try with the following regex:

<img.*?height=\"(\d+)\".*?width=\"(\d+)\"

Group 1 will capture the height and Group 2 the width

继续阅读：python regex

how to make the width and height x2 using python Regular

A cool regex trick you may not know:

A cleaner regex solution

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

A cool regex trick you may not know:

A cleaner regex solution

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？