Converting pyparsing.ParseResults back to html string
I'm brand new to py开发者_JAVA百科parsing.
How can I convert instance of class pyparsing.ParseResults back to a html string.ex.
>>> type(gcdata)
<type 'unicode'>
>>> pat
{<"div"> SkipTo:(</"div">) </"div">}
>>> type(pat)
<class 'pyparsing.And'>
>>>
>>> l = pat.searchString( gcdata )
>>> l[0]
(['div', ([u'class', u'shoveler'], {}), ([u'id', u'purchaseShvl'], {}), False, u'<div class="shoveler-heading">\n <p>Customers Who Bought This Item Also Bought</p>\n \n', '</div>'], {'startDiv': [((['div', ([u'class', u'shoveler'], {}), ([u'id', u'purchaseShvl'], {}), False], {u'class': [(u'shoveler', 1)], 'empty': [(False, 3)], u'id': [(u'purchaseShvl', 2)]}), 0)], 'endDiv': [('</div>', 5)], u'class': [(u'shoveler', 1)], 'empty': [(False, 3)], u'id': [(u'purchaseShvl', 2)]})
>>>
>>> type(l[0])
<class 'pyparsing.ParseResults'>
>>>
>>> divhtml = foo (l[0])
So, I need this function foo.
Any suggestions ?This is an issue with the expressions returned by makeHTMLTags
, that a lot of extra grouping and naming goes on, which gets in your way if you just want the tag text.
Pyparsing includes the method originalTextFor
to help address this. Building on the sample code from @samplebias:
start, end = makeHTMLTags('div')
#anchor = start + SkipTo(end).setResultsName('body') + end
anchor = originalTextFor(start + SkipTo(end).setResultsName('body') + end)
By wrapping the expression in originalTextFor
, all of the breakup of the tag into its component parts gets undone, and you just get back the text from the original string (also including any intervening whitespace). The default behavior is to just give you back this string, which has the unfortunate side effect of losing all of the results names, so getting back the parsed attribute values can be a hassle. When I wrote originalTextFor
, I assumed that a string was what was wanted, and I could not attach results names to a string. So I added an optional parameter asString
to originalTextFor
which defaults to True, but if passed as False, will return a ParseResults containing just a single token of the entire matched string, plus all matched results names. So you could still extract res.id
from the results, while res[0]
would return you the entire matched HTML.
Some other comments:
<div>
is a very common tag, and one easily matched in error using just the tag returned by makeHTMLTags
. It will match any div, and probably many you aren't really interested in. You can cut down the number of mismatches if you can specify some attribute that should also match, using withAttribute
. You could do this with:
start.setParseAction(withAttribute(id="purchaseShvl"))
or
start.setParseAction(withAttribute(**{"class":"shovelr"}))
(Using 'class' as a filtering attribute is probably the most common thing you'll want to do, but since 'class' is also a Python keyword, you can just use the named arguments form as I did with id, too bad.)
Lastly, along with the commonness of <div>
is the likelihood of nesting. divs are frequently nested within divs, and just plain SkipTo is not smart enough to take this into account. We see this when reconstructing your posted results:
<div class='shovelr' id='purchaseShvl>
<div class='shovelr-heading'>
<p>Customers WhoBought This Item Also Bought</p>
</div>
The first terminating </div>
ends the match for your expression. I suspect that you may need to expand your matching expression to take into account these additional div's, instead of just plain SkipTo(end).
You would be much better off using an HTML parser which returns a DOM, like lxml.html but I suspect you're doing this more to learn Pyparsing. Since you didn't post a snippet of source code I've taken a few guesses and made an example using pyparsing.makeHTMLTags
, listed below.
import cgi
from pyparsing import makeHTMLTags, SkipTo
raw = """<body><div class="shoveler" id="purchaseShvl">
<p>Customers who bought this item also bought</p>
<div class="foo">
<span class="bar">Shovel cozy</span>
<span class="bar">Shovel rack</span>
</div>
</div></body>"""
def foo(parseResult):
parts = []
for token in parseResult:
st = '<div id="%s" class="%s">' % \
(cgi.escape(getattr(token, 'id')),
cgi.escape(getattr(token, 'class')))
parts.append(st + token.body + token.endDiv)
return '\n'.join(parts)
start, end = makeHTMLTags('div')
anchor = start + SkipTo(end).setResultsName('body') + end
res = anchor.searchString(raw)
print foo(res)
精彩评论