Python regex issue
I'm trying to extract ALL phone screen resolutions from the WURFL XML file with the below Python script. The problem is that I only get the first match, though. Why? How could I get all matches?
The WURFL XML file can be found at http://sourceforge.net/projects/wurfl/files/WURFL/latest/wurfl-latest.zip/download?use_mirror=free开发者_JS百科fr
def read_file(file_name):
f = open(file_name, 'rb')
data = f.read()
f.close()
return data
text = read_file('wurfl.xml')
import re
pattern = '<device id="(.*?)".*actual_device_root="true">.*<capability name="resolution_width" value="(\d+)"/>.*<capability name="resolution_height" value="(\d+)"/>.*</device>'
for m in re.findall(pattern, text, re.DOTALL):
print(m)
First, use an XML parser instead of regular expressions. You'll be happier in the long run.
Second, if you insist on using regexes, use finditer()
instead of findall()
.
Third, your regex matches from the first entry to the last one (the .*
is greedy, and you have set DOTALL
mode), so either see the first paragraph or at least change your regex to
pattern = r'<device id="(.*?)".*?actual_device_root="true">.*?<capability name="resolution_width" value="(\d+)"/>.*?<capability name="resolution_height" value="(\d+)"/>.*?</device>'
Also, always use raw strings with regexes. \d
happens to work, \b
will behave unexpectedly in a "normal" string, though.
This is an oddness in the behaviour of findall
, specifically findall
only returns the first matching group from each pattern match. See this question.
You are using "greedy" matches: .*
will match as much text as it can grab, which means the .*
before <capabilities>
matches most of the file.
text = open('wurfl.xml').read()
pattern = r'<device id="(.*?)".*?actual_device_root="true">.*?<capability name="resolution_width" value="(\d+)"/>.*?<capability name="resolution_height" value="(\d+)"/>.*?</device>'
for m in re.findall(pattern, text, re.DOTALL):
print m
I'm certainly not averse to handling xml with a regexp if the requirements are simple, but perhaps in this case using a real xml parser would be better. Using the stdlib etree module and a sprinkling of (imho) hideous xpaths:
import xml.etree.ElementTree as ET
def capability_value(cap_elem):
if cap_elem is None:
return None
return int(cap_elem.attrib.get('value'))
def devices(wurfl_doc):
for el in wurfl_doc.findall("/devices/device[@actual_device_root='true']"):
width = el.find("./group[@id='display']/capability[@name='resolution_width']")
width = capability_value(width)
height = el.find("./group[@id='display']/capability[@name='resolution_height']")
height = capability_value(height)
device = {
'id' : el.attrib.get('id'),
'resolution' : {'width': width, 'height': height}
}
yield device
doc = ET.ElementTree(file='wurfl.xml')
for device in devices(doc):
print device
精彩评论