How can I control results returned by Python's re.findall() on an html string?

2023-03-15 03:35 问答作者：

I'm trying to capture all instances of "Catalina 320" SO LONG as they occur before the "These boats" string (see generic sample below).

I have the code to capture ALL instan开发者_运维知识库ces of "Catalina 320" but I can't figure out how to stop it at the "These boats" string.

resultsArray = re.findall(r'<tag>(Catalina 320)</tag>', string, re.DOTALL)

Can anyone help me solve this missing piece? I tried adding '.+These boats' but it didn't work.

Thanks- JD

  Blah blah blah
    <tag>**Catalina 320**</tag>
  Blah
    <td>**Catalina 320**</td>
  Blah Blah 
    <tag>**These boats** are fully booked for the day</tag>
  Blah blah blah
    <tag>Catalina 320</tag>
    <tag>Catalina 320</tag>

You could solve this with a regular expression, but regex isn't required based on the way that you stated problem^{See End Note 1}.

You should use lxml to parse this...

import lxml.etree as ET
from lxml.etree import XMLParser

resultsArray = []
parser = XMLParser(ns_clean=True, recover=True)
tree = ET.parse('foo.html', parser)   # See End-Note 2
for elem in tree.findall("//"):
    if "These boats" in elem.text:
        break
    elif "Catalina 320" in elem.text:
        resultsArray.append(ET.tostring(elem).strip())


print resultsArray

Executing this yields:

[mpenning@Bucksnort ~]$ python foo.py
['<tag>**Catalina 320**</tag>', '<td>**Catalina 320**</td>']
[mpenning@Bucksnort ~]$

End Notes:

The current version of your question doesn't have valid markup, but I assumed you have either xml or html (which was what you had in version 1 of the question)... my answer can handle your text as-written, but it makes more sense to assume some kind of structure markup, so I used the following input text, which I saved locally as foo.html:
```
     <body>
<tag>Blah blah blah</tag>
    <tag>**Catalina 320**</tag>
  <tag>Blah<tag>
    <td>**Catalina 320**</td>
  </tag>Blah Blah </tag>
    <tag>**These boats** are fully booked for the day</tag>
  <tag>Blah blah blah</tag>
    <tag>Catalina 320</tag>
    <tag>Catalina 320</tag>
    </body>
```
If you want to be a bit more careful about encoding issues, you can use lxml.soupparser as a fallback when parsing HTML with lxml

from lxml.html import soupparser
# ...
try:
    parser = XMLParser(ns_clean=True, recover=True)
    tree = ET.parse('foo.html', parser)
except UnicodeDecodeError:
    tree = soupparser.parse('foo.html')

If there is no other context to your problem, you can just search before the first occurrence of 'These boats':

re.findall('Catalina 320', string.split('These boats')[0])

groups = re.findall(r'(Catalina 320)*.*These boats, r.read(), re.DOTALL)

the first group in groups will contain the list of Catalina 320 matches.

With file of name 'foo.html' containing

     <body>
<tag>Blah blah blah</tag>
    <tag>**Catalina 320**</tag>
  <tag>Blah<tag>
    <td>**Catalina 320**</td>
  </tag>Blah Blah </tag>
    <tag>**These boats** are fully booked for the day</tag>
  <tag>Blah blah blah</tag>
    <tag>Catalina 320</tag>
    <tag>Catalina 320</tag>
    </body>

code:

from time import clock
n = 1000


########################################################################

import lxml.etree as ET
from lxml.etree import XMLParser

parser = XMLParser(ns_clean=True, recover=True)
etree = ET.parse('foo.html', parser)

te = clock()
for i in xrange(n):
    resultsArray = []
    for thing in etree.findall("//"):
        if "These boats" in thing.text:
            break
        elif "Catalina 320"in thing.text:
            resultsArray.append(ET.tostring(thing).strip())
tf = clock()

print 'Solution with lxml'
print tf-te,'\n',resultsArray


########################################################################

with open('foo.html') as f:
    text = f.read()
    
import re


print '\n\n----------------------------------'
rigx = re.compile('(Catalina 320)(?:(?:.(?!Catalina 320))*These boats.*\Z)?',re.DOTALL)

te = clock()
for i in xrange(n):
    yi = rigx.findall(text)
tf = clock()

print 'Solution 1 with a regex'
print tf-te,'\n',yi


print '\n----------------------------------'

ragx = re.compile('(Catalina 320)|(These boats)')

te = clock()
for i in xrange(n):
    li = []
    for mat in ragx.finditer(text):
        if mat.group(2):
            break
        else:
            li.append(mat.group(1))
tf = clock()

print 'Solution 2 with a regex, similar to solution with lxml'
print tf-te,'\n',li


print '\n----------------------------------'

regx = re.compile('(Catalina 320)')

te = clock()
for i in xrange(n):
    ye = regx.findall(text, 0, text.find('These boats') if 'These boats' in text else len(text)) 
tf = clock()

print 'Solution 3 with a regex'
print tf-te,'\n',ye

result

Solution with lxml
0.30324105438 
['<tag>**Catalina 320**</tag>', '<td>**Catalina 320**</td>']


----------------------------------
Solution 1 with regex
0.0245033935877 
['Catalina 320', 'Catalina 320']

----------------------------------
Solution 2 with a regex, similar to solution with lxml
0.0233258696287
['Catalina 320', 'Catalina 320']

----------------------------------
Solution 3 with regex
0.00784708671074 
['Catalina 320', 'Catalina 320']

What is wrong in my solutions with regex ??

Times:

lxml - 100 %

solution 1 - 8.1 %

solution 2 - 7.7 %

solution 3 - 2.6 %

Using a regex doesn't requires the text to be an XML or HTML text.

So, what are the remaining arguments to pretend that regexes are inferior to lxml to treat this problem ??

EDIT 1

The solution with rigx = re.compile('(Catalina 320)(?:(?:.(?!Catalina 320))*These boats.*\Z)?',re.DOTALL) isn't good:

this regex will catch the occurences of 'Catalina 320' situated AFTER 'These boats' IF there are no occurences of 'Catalina 320' BEFORE 'These boats'

The pattern must be:

rigx = re.compile('(<tag>Catalina 320</tag>)(?:(?:.(?!<tag>Catalina 320</tag>))*These boats.*\Z)?|These boats.*\Z',re.DOTALL)

But this is a rather complicated pattern compared to other solutions

继续阅读：lxml python regex

How can I control results returned by Python's re.findall() on an html string?

So, what are the remaining arguments to pretend that regexes are inferior to lxml to treat this problem ??

EDIT 1

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

So, what are the remaining arguments to pretend that regexes are inferior to lxml to treat this problem ??

EDIT 1

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？