input URL, output contents of "view page source", i.e. after javascript / etc, library or command-line
I need a scalable, automated, method of dumping the contents of "view page source", after manipulation, to a file. This non-interactive me开发者_高级运维thod would be (more or less) identical to an army of humans navigating my list of URLs and dumping "view page source" to a file. Programs such as wget or curl will non-interactively retrieve a set of URLs, but do not execute javascript or any of that 'fancy stuff'.
My ideal solution looks like any of the following (fantasy solutions):
cat urls.txt | google-chrome --quiet --no-gui \
--output-sources-directory=~/urls-source
(fantasy command line, no idea if flags like these exist)
or
cat urls.txt | python -c "import some-library; \
... use some-library to process urls.txt ; output sources to ~/urls-source"
As a secondary concern, I also need:
- dump all included javascript source to file (a la firebug)
- dump pdf/image of page to file (print to file)
HTML Unit does execute javascript. Not sure if you can obtain the HTML code after DOM manipulation, but give it a try.
You could write a little Java program that fits your requirements, and execute it through command line like in your examples.
I haven't tried the below code, just had a look at the JavaDoc :
public static void main(String[] args) {
String pageURL = args[1];
WebClient webClient = new WebClient();
HtmlPage page = webClient.getPage(pageURL);
String pageContents = page.asText();
// Save the resulting page to a file
}
EDIT :
Selenium (another web testing framework) can take page screenshots it seems.
Search for selenium.captureScreenshot.
You can use IRobotSoft web scraper to automate this. The source code is in UpdatedPage variable. You only need to save the variable to a file.
It has a function CapturePage() to capture the web page to an image file too.
精彩评论