How to parse through script tag using python and beautifulsoup

2022-12-13 14:59 问答作者：

I am trying to extract the attributes of a frame tag which is inside document.write function on a page as follows:

<script language="javascript">
.
.
.
document.write('<frame name="nav" src="/nav/inde开发者_运维问答x_nav.html" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" border = "no" noresize>');
 if (anchor != "") {
  document.write('<frame name="body" src="http://content.members.fidelity.com/mfl/summary/0,,' + cusip + ',00.html?' + anchor + '" marginwidth="0" marginheight="0" scrolling="auto" frameborder="0" noresize>');
 } else {
  document.write('<frame name="body" src="http://content.members.fidelity.com/mfl/summary/0,,' + cusip + ',00.html" marginwidth="0" marginheight="0" scrolling="auto" frameborder="0" noresize>');
 }
 document.write('</frameset>');


// end hiding -->
</script>

findAll('frame') method didn't help. Is there a way to read the contents of frame tag?

I am using python 2.5 and BeautifulSoup 3.0.8.

I am also open to using python 3.1 with BeautifulSoup 3.1 so long as i am able to get the results.

Thanks

You can't do it with BeautifulSoup alone. BeautifulSoup parses HTML as it would arrive to the browser (before any rewriting or DOM manipulation), and it does not parse (let alone execute) Javascript.

You might want to use a simple regular expression in this special case.

Pyparsing might help you bridge this mix of JS and HTML. This parser looks for document.write statements containing a quoted string or a string expression of several quoted strings and identifiers, quasi-evaluates the string expression, parses it for an embedded <frame> tag, and returns the frame attributes as a pyparsing ParseResults object, which gives you access to the named attributes as if they were object attributes or dict keys (your preference).

jssrc = """
<script language="javascript">
.
.
.
document.write('<frame name="nav" src="/nav/index_nav.html" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" border = "no" noresize>'); 
if (anchor != "") 
{  document.write('<frame name="body" src="http://content.members.fidelity.com/mfl/summary/0,,' + cusip + ',00.html?' + anchor + '" marginwidth="0" marginheight="0" scrolling="auto" frameborder="0" noresize>'); } 
else 
{  document.write('<frame name="body" src="http://content.members.fidelity.com/mfl/summary/0,,' + cusip + ',00.html" marginwidth="0" marginheight="0" scrolling="auto" frameborder="0" noresize>'); } 
document.write('</frameset>');
    // end hiding -->
    </script>"""

from pyparsing import *

# define some basic punctuation, and quoted string
LPAR,RPAR,PLUS = map(Suppress,"()+")
qs = QuotedString("'")

# use pyparsing helper to define an expression for opening <frame> 
# tags, which includes support for attributes also
frameTag = makeHTMLTags("frame")[0]

# some of our document.write statements contain not a sting literal,
# but an expression of strings and vars added together; define
# an identifier expression, and add a parse action that converts
# a var name to a likely value
ident = Word(alphas).setParseAction(lambda toks: evalvars[toks[0]])
evalvars = { 'cusip' : "CUSIP", 'anchor' : "ANCHOR" }

# now define the string expression itself, as a quoted string,
# optionally followed by identifiers and quoted strings added
# together; identifiers will get translated to their defined values
# as they are parsed; the first parse action on stringExpr concatenates
# all the tokens; then the second parse action actually parses the
# body of the string as a <frame> tag and returns the results of parsing
# the tag and its attributes; if the parse fails (that is, if the
# string contains something that is not a <frame> tag), the second
# parse action will throw an exception, which will cause the stringExpr
# expression to fail
stringExpr = qs + ZeroOrMore( PLUS + (ident | qs))
stringExpr.setParseAction(lambda toks : ''.join(toks))
stringExpr.addParseAction(lambda toks: 
    frameTag.parseString(toks[0],parseAll=True))

# finally, define the overall document.write(...) expression
docWrite = "document.write" + LPAR + stringExpr + RPAR

# scan through the source looking for document.write commands containing
# <frame> tags using scanString; print the original source fragment, 
# then access some of the attributes extracted from the <frame> tag
# in the quoted string, using either object-attribute notation or 
# dict index notation
for dw,locstart,locend in docWrite.scanString(jssrc):
    print jssrc[locstart:locend]
    print dw.name
    print dw["src"]
    print

Prints:

document.write('<frame name="nav" src="/nav/index_nav.html" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" border = "no" noresize>')
nav
/nav/index_nav.html

document.write('<frame name="body" src="http://content.members.fidelity.com/mfl/summary/0,,' + cusip + ',00.html?' + anchor + '" marginwidth="0" marginheight="0" scrolling="auto" frameborder="0" noresize>')
body
http://content.members.fidelity.com/mfl/summary/0,,CUSIP,00.html?ANCHOR

document.write('<frame name="body" src="http://content.members.fidelity.com/mfl/summary/0,,' + cusip + ',00.html" marginwidth="0" marginheight="0" scrolling="auto" frameborder="0" noresize>')
body
http://content.members.fidelity.com/mfl/summary/0,,CUSIP,00.html

继续阅读：python

How to parse through script tag using python and beautifulsoup

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？