Php screen scraping with cURL and xpath

2023-02-19 01:01 问答作者：

I am trying to user xpath to scrape out a site but the initial site is a widget and not raw html so i need some way have executing the widget code to get the html.

The url I wan to scrape is: https://www.dealcurrent.com/customwidget.php?publisherID=36&widget=largewidget

if i echo the $html that curl_exec returns it gives me the proper html rendered, but if i just print out the $html directly it gives me something like:

<br />[ ]<br>[ try {if(window.top.location==document.URL) document.write('<meta http-equiv=refresh content="0;url=\'http://www.sweetfind.com/\'"/>'); } catch(e) {}Sweet Findif(34>=10000) window.location.href="https://www.dealcurrent.com/customwidget.php?widget=largewidget_soldout&publisherID=36"; #nav a:link { color:#666666; font-family:Arial, Helvetica, sans-serif; font-size:12px; tex开发者_如何学JAVAt-decoration:none; } #nav a:visited { font-family:Arial, Helvetica, sans-serif; color:#666666; text-decoration:none; font-size:12px; text-decoration:none; } #nav a:hover { font-family:Arial, Helvetica,

etc...

is there any way i can "execute" the code above to get the html output so i can use it with xpath?

Curl only gives you the HTML output, and can't execute javascript since it's not a browser. Your best bet is to find another scraping tool such as Selenium to grab the contents of the page after the Javascript executes. Curl probably does you no good here.

The short answer to your question is "No"; cURL does not support JavaScript (and it probably never will, as that is not what it's built for), nor does any library for PHP. See below for a list of options:

Reverse engineering the JavaScript

If you have to do this only once, then switching tools is probably not the best solution (with codebase compatibility, and all that). In this case, you could try manually emulating the effects of the JavaScript in your code; if it says window.location="example.com", you fetch 'example.com'; if it fill out, and submits a form, you send a POST request. However, you will probably tire of this rather quickly - I know I did.

In this specific case, if you're trying to capture the page you're being redirected to, you could try to use strpos and substr to break apart the meta-redirect that is being inserted by the JavaScript, to get to the url, and simply follow that.

Alternatives to PHP/cURL

For PHP, there currently aren't any tools (as far as I know) that allow you to execute JavaScipt (or Flash) which is what you're going to run into eventually when scraping; and I've looked hard for a solution. (If you find any, please let me know.) So, when you eventually get tired of "emulating" the right scripts on a page.

Note that what you'll mostly be using are tools for Web application testing; these just lend themselves rather well for scraping.

Watir: the best tool for full JavaScript and Flash execution I have found thus far is Watir, which allows you to control an instance of any major browser, from Ruby; I know that it has been ported to both Java and .Net, but I have never used any of these implementations. Note that Watir also has a very accessible implementation for XPath:
Mechanize: a web library which has implementations in most popular languages (those I know of are at least in Ruby, Python and (the original, I believe) in Perl.
Selenium: as Hisoka mentions, Selenium is also a respected tool.
HtmlUnit: Another good tool (which occasionally breaks on JavaScript, and as far as I know does not implement any Flash execution) is HtmlUnit, as a Java library. I've used this for a while, and it gave me the impression "bulkyness", and this one is a webapp-testing tool to it's core. (Which is a bad thing, as you probably don't want HTML and CSS error reporting.)

(Note that this is in no way a complete list.)

Code examples

An example using Watir:

browser = Watir::Browser.new
browser.goto("example.com")
browser.h1(:xpath, "//h1[@id='header']").click

I'm not sure if this is what you're looking for?

However, you have to be careful about paths defined in the code.

echo file_get_contents($url);

继续阅读：curl php web-scraping

Php screen scraping with cURL and xpath

Reverse engineering the JavaScript

Alternatives to PHP/cURL

Code examples

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Reverse engineering the JavaScript

Alternatives to PHP/cURL

Code examples

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？