Using BeautifulSoup to parse lines separated by <br> tags?

2022-12-20 16:07 问答作者：

I have a page that looks like this:

Company A<br />
123 Main St.<br />
Suite 101<br />
Someplace, NY 1234<br />
<br />
<br />
<br />
Company B<br />
456 Main St.<br />
So开发者_开发问答meplace, NY 1234<br />
<br />
<br />
<br />

Sometimes there are two rather than three "br" tags separating the entries. How would I use BeautifulSoup to parse through this document and extract the fields? I'm stumped because the bits of text that I need are not contained in paragraph (or similar) tags that I can simply iterate through.

You should look into the .stringsattribute found in tags, then use "\n".join() on that.

Once you have this HTML fragment, just use a regex to replace <br /> followed by an optional newline by a single newline, then split on multiple newlines. This should result in multiple individual paragraphs which you can process manually.

you can do a little bit of manipulation first before anything. eg change all newlines to blanks, then substitute 2 occurrences and more of <br /> to some other delimiter like |. after that you can get your fields.

html="""
Company A<br />
123 Main St.<br />
Suite 101<br />
Someplace, NY 1234<br />
<br />
<br />
<br />
Company B<br />
456 Main St.<br />
Someplace, NY 1234<br />
<br />
<br />
<br />
"""
import re
newhtml=html.replace("\n","")
pat=re.compile("(<br \/>){2,}",re.DOTALL|re.M)
print pat.sub("|",newhtml)

output

$ ./python.py
Company A<br />123 Main St.<br />Suite 101<br />Someplace, NY 1234|Company B<br />456 Main St.<br />Someplace, NY 1234|

Now your company information are separated by pipes.

Perhaps you could use this function:

def partition_by(pred, iterable):
    current = None
    current_flag = None
    chunk = []
    for item in iterable:
        if current is None:
            current = item
            current_flag = pred(current)
            chunk = [current]
        elif pred(item) == current_flag:
            chunk.append(item)
        else:
            yield chunk
            current = item
            current_flag = not current_flag
            chunk = [current]
    if len(chunk) > 0:
        yield chunk

Add something to check for being a <br /> tag or newline:

def is_br(bs):
    try:
        return bs.name == u'br'
    except AttributeError:
        return False

def is_br_or_nl(bs):
    return is_br(bs) or u'\n' == bs

(Or whatever else is more appropriate... I'm not that good with BeautifulSoup.)

Then use partition_by(is_br_or_nl, cs) to yield (for cs set to BeautifulSoup.BeautifulSoup(your_example_html).childGenerator())

[[u'Company A'],
 [<br />],
 [u'\n123 Main St.'],
 [<br />],
 [u'\nSuite 101'],
 [<br />],
 [u'\nSomeplace, NY 1234'],
 [<br />, u'\n', <br />, u'\n', <br />, u'\n', <br />],
 [u'\nCompany B'],
 [<br />],
 [u'\n456 Main St.'],
 [<br />],
 [u'\nSomeplace, NY 1234'],
 [<br />, u'\n', <br />, u'\n', <br />, u'\n', <br />]]

That should be easy enough to process.

To generalise this, you'd probably have to write a predicate to check whether its argument is something you care about... Then you could use it with partition_by to have everything else lumped together. Note that the things which you care about are lumped together as well -- you basically have to process every item of every second list produced by the resulting generator, starting with the first one which includes things you care about.

I have slimier issue .this how i solved

html=html.replace('<br>','<br />')

继续阅读：parsing python

Using BeautifulSoup to parse lines separated by <br> tags?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？