Using BeautifulSoup to parse lines separated by <br> tags?
I have a page that looks like this:
Company A<br />
123 Main St.<br />
Suite 101<br />
Someplace, NY 1234<br />
<br />
<br />
<br />
Company B<br />
456 Main St.<br />
So开发者_开发问答meplace, NY 1234<br />
<br />
<br />
<br />
Sometimes there are two rather than three "br" tags separating the entries. How would I use BeautifulSoup to parse through this document and extract the fields? I'm stumped because the bits of text that I need are not contained in paragraph (or similar) tags that I can simply iterate through.
You should look into the .strings
attribute found in tags, then use "\n".join() on that.
Once you have this HTML fragment, just use a regex to replace <br />
followed by an optional newline by a single newline, then split on multiple newlines. This should result in multiple individual paragraphs which you can process manually.
you can do a little bit of manipulation first before anything. eg change all newlines to blanks, then substitute 2 occurrences and more of <br />
to some other delimiter like |
. after that you can get your fields.
html="""
Company A<br />
123 Main St.<br />
Suite 101<br />
Someplace, NY 1234<br />
<br />
<br />
<br />
Company B<br />
456 Main St.<br />
Someplace, NY 1234<br />
<br />
<br />
<br />
"""
import re
newhtml=html.replace("\n","")
pat=re.compile("(<br \/>){2,}",re.DOTALL|re.M)
print pat.sub("|",newhtml)
output
$ ./python.py
Company A<br />123 Main St.<br />Suite 101<br />Someplace, NY 1234|Company B<br />456 Main St.<br />Someplace, NY 1234|
Now your company information are separated by pipes.
Perhaps you could use this function:
def partition_by(pred, iterable):
current = None
current_flag = None
chunk = []
for item in iterable:
if current is None:
current = item
current_flag = pred(current)
chunk = [current]
elif pred(item) == current_flag:
chunk.append(item)
else:
yield chunk
current = item
current_flag = not current_flag
chunk = [current]
if len(chunk) > 0:
yield chunk
Add something to check for being a <br />
tag or newline:
def is_br(bs):
try:
return bs.name == u'br'
except AttributeError:
return False
def is_br_or_nl(bs):
return is_br(bs) or u'\n' == bs
(Or whatever else is more appropriate... I'm not that good with BeautifulSoup.)
Then use partition_by(is_br_or_nl, cs)
to yield (for cs
set to BeautifulSoup.BeautifulSoup(your_example_html).childGenerator()
)
[[u'Company A'],
[<br />],
[u'\n123 Main St.'],
[<br />],
[u'\nSuite 101'],
[<br />],
[u'\nSomeplace, NY 1234'],
[<br />, u'\n', <br />, u'\n', <br />, u'\n', <br />],
[u'\nCompany B'],
[<br />],
[u'\n456 Main St.'],
[<br />],
[u'\nSomeplace, NY 1234'],
[<br />, u'\n', <br />, u'\n', <br />, u'\n', <br />]]
That should be easy enough to process.
To generalise this, you'd probably have to write a predicate to check whether its argument is something you care about... Then you could use it with partition_by
to have everything else lumped together. Note that the things which you care about are lumped together as well -- you basically have to process every item of every second list produced by the resulting generator, starting with the first one which includes things you care about.
I have slimier issue .this how i solved
html=html.replace('<br>','<br />')
精彩评论