How can I control results returned by Python's re.findall() on an html string?
I'm trying to capture all instances of "Catalina 320" SO LONG as they occur before the "These boats" string (see generic sample below).
I have the code to capture ALL instan开发者_运维知识库ces of "Catalina 320" but I can't figure out how to stop it at the "These boats" string.
resultsArray = re.findall(r'<tag>(Catalina 320)</tag>', string, re.DOTALL)
Can anyone help me solve this missing piece? I tried adding '.+These boats' but it didn't work.
Thanks- JD
Blah blah blah
<tag>**Catalina 320**</tag>
Blah
<td>**Catalina 320**</td>
Blah Blah
<tag>**These boats** are fully booked for the day</tag>
Blah blah blah
<tag>Catalina 320</tag>
<tag>Catalina 320</tag>
You could solve this with a regular expression, but regex isn't required based on the way that you stated problemSee End Note 1.
You should use lxml
to parse this...
import lxml.etree as ET
from lxml.etree import XMLParser
resultsArray = []
parser = XMLParser(ns_clean=True, recover=True)
tree = ET.parse('foo.html', parser) # See End-Note 2
for elem in tree.findall("//"):
if "These boats" in elem.text:
break
elif "Catalina 320" in elem.text:
resultsArray.append(ET.tostring(elem).strip())
print resultsArray
Executing this yields:
[mpenning@Bucksnort ~]$ python foo.py
['<tag>**Catalina 320**</tag>', '<td>**Catalina 320**</td>']
[mpenning@Bucksnort ~]$
End Notes:
The current version of your question doesn't have valid markup, but I assumed you have either xml or html (which was what you had in version 1 of the question)... my answer can handle your text as-written, but it makes more sense to assume some kind of structure markup, so I used the following input text, which I saved locally as foo.html:
<body> <tag>Blah blah blah</tag> <tag>**Catalina 320**</tag> <tag>Blah<tag> <td>**Catalina 320**</td> </tag>Blah Blah </tag> <tag>**These boats** are fully booked for the day</tag> <tag>Blah blah blah</tag> <tag>Catalina 320</tag> <tag>Catalina 320</tag> </body>
If you want to be a bit more careful about encoding issues, you can use
lxml.soupparser
as a fallback when parsing HTML withlxml
from lxml.html import soupparser
# ...
try:
parser = XMLParser(ns_clean=True, recover=True)
tree = ET.parse('foo.html', parser)
except UnicodeDecodeError:
tree = soupparser.parse('foo.html')
If there is no other context to your problem, you can just search before the first occurrence of 'These boats'
:
re.findall('Catalina 320', string.split('These boats')[0])
groups = re.findall(r'(Catalina 320)*.*These boats, r.read(), re.DOTALL)
the first group in groups will contain the list of Catalina 320 matches.
With file of name 'foo.html' containing
<body>
<tag>Blah blah blah</tag>
<tag>**Catalina 320**</tag>
<tag>Blah<tag>
<td>**Catalina 320**</td>
</tag>Blah Blah </tag>
<tag>**These boats** are fully booked for the day</tag>
<tag>Blah blah blah</tag>
<tag>Catalina 320</tag>
<tag>Catalina 320</tag>
</body>
code:
from time import clock
n = 1000
########################################################################
import lxml.etree as ET
from lxml.etree import XMLParser
parser = XMLParser(ns_clean=True, recover=True)
etree = ET.parse('foo.html', parser)
te = clock()
for i in xrange(n):
resultsArray = []
for thing in etree.findall("//"):
if "These boats" in thing.text:
break
elif "Catalina 320"in thing.text:
resultsArray.append(ET.tostring(thing).strip())
tf = clock()
print 'Solution with lxml'
print tf-te,'\n',resultsArray
########################################################################
with open('foo.html') as f:
text = f.read()
import re
print '\n\n----------------------------------'
rigx = re.compile('(Catalina 320)(?:(?:.(?!Catalina 320))*These boats.*\Z)?',re.DOTALL)
te = clock()
for i in xrange(n):
yi = rigx.findall(text)
tf = clock()
print 'Solution 1 with a regex'
print tf-te,'\n',yi
print '\n----------------------------------'
ragx = re.compile('(Catalina 320)|(These boats)')
te = clock()
for i in xrange(n):
li = []
for mat in ragx.finditer(text):
if mat.group(2):
break
else:
li.append(mat.group(1))
tf = clock()
print 'Solution 2 with a regex, similar to solution with lxml'
print tf-te,'\n',li
print '\n----------------------------------'
regx = re.compile('(Catalina 320)')
te = clock()
for i in xrange(n):
ye = regx.findall(text, 0, text.find('These boats') if 'These boats' in text else len(text))
tf = clock()
print 'Solution 3 with a regex'
print tf-te,'\n',ye
result
Solution with lxml
0.30324105438
['<tag>**Catalina 320**</tag>', '<td>**Catalina 320**</td>']
----------------------------------
Solution 1 with regex
0.0245033935877
['Catalina 320', 'Catalina 320']
----------------------------------
Solution 2 with a regex, similar to solution with lxml
0.0233258696287
['Catalina 320', 'Catalina 320']
----------------------------------
Solution 3 with regex
0.00784708671074
['Catalina 320', 'Catalina 320']
What is wrong in my solutions with regex ??
Times:
lxml - 100 %
solution 1 - 8.1 %
solution 2 - 7.7 %
solution 3 - 2.6 %
Using a regex doesn't requires the text to be an XML or HTML text.
.
So, what are the remaining arguments to pretend that regexes are inferior to lxml to treat this problem ??
EDIT 1
The solution with rigx = re.compile('(Catalina 320)(?:(?:.(?!Catalina 320))*These boats.*\Z)?',re.DOTALL)
isn't good:
this regex will catch the occurences of 'Catalina 320' situated AFTER 'These boats' IF there are no occurences of 'Catalina 320' BEFORE 'These boats'
The pattern must be:
rigx = re.compile('(<tag>Catalina 320</tag>)(?:(?:.(?!<tag>Catalina 320</tag>))*These boats.*\Z)?|These boats.*\Z',re.DOTALL)
But this is a rather complicated pattern compared to other solutions
精彩评论