input URL, output contents of "view page source", i.e. after javascript / etc, library or command-line

2023-01-01 07:07 问答作者：

I need a scalable, automated, method of dumping the contents of "view page source", after manipulation, to a file. This non-interactive me开发者_高级运维thod would be (more or less) identical to an army of humans navigating my list of URLs and dumping "view page source" to a file. Programs such as wget or curl will non-interactively retrieve a set of URLs, but do not execute javascript or any of that 'fancy stuff'.

My ideal solution looks like any of the following (fantasy solutions):

cat urls.txt | google-chrome --quiet --no-gui \
--output-sources-directory=~/urls-source  
(fantasy command line, no idea if flags like these exist)

cat urls.txt | python -c "import some-library; \
... use some-library to process urls.txt ; output sources to ~/urls-source"

As a secondary concern, I also need:

dump all included javascript source to file (a la firebug)
dump pdf/image of page to file (print to file)

HTML Unit does execute javascript. Not sure if you can obtain the HTML code after DOM manipulation, but give it a try.

You could write a little Java program that fits your requirements, and execute it through command line like in your examples.

I haven't tried the below code, just had a look at the JavaDoc :

public static void main(String[] args) {

    String pageURL = args[1];

    WebClient webClient = new WebClient();
    HtmlPage page = webClient.getPage(pageURL);

    String pageContents = page.asText();

    // Save the resulting page to a file

}

EDIT :

Selenium (another web testing framework) can take page screenshots it seems.

Search for selenium.captureScreenshot.

You can use IRobotSoft web scraper to automate this. The source code is in UpdatedPage variable. You only need to save the variable to a file.

It has a function CapturePage() to capture the web page to an image file too.

继续阅读：html-parsing javascript

input URL, output contents of "view page source", i.e. after javascript / etc, library or command-line

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？