Converting pyparsing.ParseResults back to html string

2023-02-16 04:18 问答作者：

How can I convert instance of class pyparsing.ParseResults back to a html string.

ex.

>>> type(gcdata)
<type 'unicode'>
>>> pat
{<"div"> SkipTo:(</"div">) </"div">}
>>> type(pat)
<class 'pyparsing.And'>
>>> 
>>> l = pat.searchString( gcdata  )
>>> l[0]
(['div', ([u'class', u'shoveler'], {}), ([u'id', u'purchaseShvl'], {}), False, u'<div class="shoveler-heading">\n    <p>Customers Who Bought This Item Also Bought</p>\n    \n', '</div>'], {'startDiv': [((['div', ([u'class', u'shoveler'], {}), ([u'id', u'purchaseShvl'], {}), False], {u'class': [(u'shoveler', 1)], 'empty': [(False, 3)], u'id': [(u'purchaseShvl', 2)]}), 0)], 'endDiv': [('</div>', 5)], u'class': [(u'shoveler', 1)], 'empty': [(False, 3)], u'id': [(u'purchaseShvl', 2)]})
>>> 
>>> type(l[0])
<class 'pyparsing.ParseResults'>
>>> 
>>> divhtml = foo (l[0])

So, I need this function foo.

Any suggestions ?

This is an issue with the expressions returned by makeHTMLTags, that a lot of extra grouping and naming goes on, which gets in your way if you just want the tag text.

Pyparsing includes the method originalTextFor to help address this. Building on the sample code from @samplebias:

start, end = makeHTMLTags('div')
#anchor = start + SkipTo(end).setResultsName('body') + end 
anchor = originalTextFor(start + SkipTo(end).setResultsName('body') + end)

By wrapping the expression in originalTextFor, all of the breakup of the tag into its component parts gets undone, and you just get back the text from the original string (also including any intervening whitespace). The default behavior is to just give you back this string, which has the unfortunate side effect of losing all of the results names, so getting back the parsed attribute values can be a hassle. When I wrote originalTextFor, I assumed that a string was what was wanted, and I could not attach results names to a string. So I added an optional parameter asString to originalTextFor which defaults to True, but if passed as False, will return a ParseResults containing just a single token of the entire matched string, plus all matched results names. So you could still extract res.id from the results, while res[0] would return you the entire matched HTML.

Some other comments:

<div> is a very common tag, and one easily matched in error using just the tag returned by makeHTMLTags. It will match any div, and probably many you aren't really interested in. You can cut down the number of mismatches if you can specify some attribute that should also match, using withAttribute. You could do this with:

start.setParseAction(withAttribute(id="purchaseShvl"))

start.setParseAction(withAttribute(**{"class":"shovelr"}))

(Using 'class' as a filtering attribute is probably the most common thing you'll want to do, but since 'class' is also a Python keyword, you can just use the named arguments form as I did with id, too bad.)

Lastly, along with the commonness of <div> is the likelihood of nesting. divs are frequently nested within divs, and just plain SkipTo is not smart enough to take this into account. We see this when reconstructing your posted results:

<div class='shovelr' id='purchaseShvl>
<div class='shovelr-heading'>
<p>Customers WhoBought This Item Also Bought</p>
</div>

The first terminating </div> ends the match for your expression. I suspect that you may need to expand your matching expression to take into account these additional div's, instead of just plain SkipTo(end).

You would be much better off using an HTML parser which returns a DOM, like lxml.html but I suspect you're doing this more to learn Pyparsing. Since you didn't post a snippet of source code I've taken a few guesses and made an example using pyparsing.makeHTMLTags, listed below.

import cgi
from pyparsing import makeHTMLTags, SkipTo

raw = """<body><div class="shoveler" id="purchaseShvl">
<p>Customers who bought this item also bought</p>
<div class="foo">
    <span class="bar">Shovel cozy</span>
    <span class="bar">Shovel rack</span>
</div>
</div></body>"""

def foo(parseResult):
    parts = []
    for token in parseResult:
        st = '<div id="%s" class="%s">' % \
             (cgi.escape(getattr(token, 'id')),
             cgi.escape(getattr(token, 'class')))
        parts.append(st + token.body + token.endDiv)
    return '\n'.join(parts)

start, end = makeHTMLTags('div')
anchor = start + SkipTo(end).setResultsName('body') + end
res = anchor.searchString(raw)
print foo(res)

继续阅读：html-parsing parsing pyparsing python xml-parsing

Converting pyparsing.ParseResults back to html string

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？