Using Scrapy with Javascript and iFrames and alternatives [closed]
开发者_StackOverflow社区
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this questionI'm trying to use Scrapy to scrape the U.S. government regulations website (www.regulations.gov). It's got a ton of information on it, but it's a terrible website, that is chock-full of javascript and iframes. I tried to run some simple Scrapy spiders, but I can't parse anything out because everything loads through Javascript and iframes.
For instance, on the main search page, this block of code actually loads the results table:
<script type="text/javascript" src="Regs/Regs.nocache.js?REGS211-b3"></script>
<title>Regulations.gov</title>
<link rel="stylesheet" type="text/css" href="css/print.css" media="print" />
</head>
<body class="bodyLoading">
<!-- this is required for GWT history support -->
<iframe src="javascript:''" id="__gwt_historyFrame" tabIndex='-1' style="position:absolute;width:0;height:0;border:0"></iframe>
<!-- For printing window contents -->
<iframe id="__printingFrame" style="width:0;height:0;border:0;" ></iframe>
And, individual results pages have the same problem. For instance, this page has the same source as above.
Can Scrapy handle this problem at all? Are there any alternatives that might be able to?
Alternatives : try
1) selenium
2) imacros
3) PhantomJS with CasperJS
精彩评论