regex to extract part of filename
I want to extract part of a filename that is containe开发者_StackOverflow社区d in a xml string
Sample
<assets>
<media width="100%" height="100%" img="/assets/560PEgnR/kVvNKfOX7w9tf7.JPG" valign="top"/>
<media width="100%" height="100%" img="/assets/560PEgnR/kVvNKfOX7w9tf5.JPG" valign="top"/>
<media width="100%" height="100%" img="/assets/560PEgnR/kVvNKfOX7w9tf4.JPG" valign="top"/>
</assets>
I want to match and retrieve the 560PEgnR portion from all entries, regardless of the filename
So far I have
/assets/(.*)/*"
But it doesn't do what I want
Any help appreciated
Thanks
Alternatively...
/assets/([^/])+/
You should try with:
/assets/(.*?)/.*
.*
is gready, but using ?
it stops on the first /
.
There are several alternatives. Your mistake is that your .* part also includes the '/', so either you make it less greedy (as hsz proposed above) or you exclude a '/' from the matching group like this /assets/([^/]*).*
.
A non-regex approach
>>> string="""
... <assets>
... <media width="100%" height="100%" img="/assets/560PEgnR/kVvNKfOX7w9tf7.JPG" valign="top"/>
... <media width="100%" height="100%" img="/assets/560PEgnR/kVvNKfOX7w9tf5.JPG" valign="top"/>
... <media width="100%" height="100%" img="/assets/560PEgnR/kVvNKfOX7w9tf4.JPG" valign="top"/>
... </assets>
... """
>>> for line in string.split("\n"):
... if "/assets/" in line:
... print line.split("/assets/")[-1].split("/")[0]
...
560PEgnR
560PEgnR
560PEgnR
Properly parsing the XML and avoiding the unnecessary use of regex:
from lxml import etree
xml = """
<assets>
<media width="100%" height="100%" img="/assets/560PEgnR/kVvNKfOX7w9tf7.JPG" valign="top"/>
<media width="100%" height="100%" img="/assets/560PEgnR/kVvNKfOX7w9tf5.JPG" valign="top"/>
<media width="100%" height="100%" img="/assets/560PEgnR/kVvNKfOX7w9tf4.JPG" valign="top"/>
</assets>
"""
xmltree = etree.fromstring(xml)
for media in xmltree.iterfind(".//media"):
path = media.get('img')
print path.split('/')[-2]
Gives:
560PEgnR 560PEgnR 560PEgnR
精彩评论