Extracting data from JavaScript (Python Scraper)

2023-02-07 18:32 问答作者：

I'm currently using a fusion of urllib2, pyquery, and json to scrape a site, and now I find that I need to extract some data from JavaScript. One thought would be to use a JavaScript engine (like V8), but that seems like overkill for what I need. I would use regular expressions, but the expression for this seems way to complex.

开发者_运维百科

JavaScript:

(function(){DOM.appendContent(this, HTML("<html>"));;})

I need to extract the <html>, but I'm not entirely sure how to do so. The <html> itself can contain basically every character under the sun, so [^"] won't work.

Any thoughts?

Why regex? Can't you just use two substrings as you know how many characters you want to trim off the beginning and end?

string[42:-7]

As well as being quicker than a regex, it then doesn't matter if quotes inside <html> are escaped or not.

If every occurance of " inside the html code would be escaped by using \" (it is a JavaScript string after all), you could use

HTML\("((?:\\"|.)*?)"\)

to get the parameter to HTML into the first capturing group.

Note that this Regex is not yet escaped to be a Javascript String itself.

继续阅读：javascript python regex scraper web-scraping

Extracting data from JavaScript (Python Scraper)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？