downloading morningstar webpages for screenscraping

2023-03-20 07:29 问答作者：

I'd like to be able to screenscrape Morningstar webpages. Morningstar provides information about a mutual fund that I routinely look up but haven't been able to find elsewhere, ie

total return compared against benchmark
total return compared against peers
percentile ranking

Here's an example: morningstar example

As a prelude to screenscraping, I need to be able to download the webpage with the desired content. Unfortunately, when I try using Java SE6 or wget to retrieve the above example link, I only get a portion of the html (the tables displaying the total return figures are absent). I get the same result, if I use my browser (Chrome), to save the page as h开发者_高级运维tml only. I notice that if I use my browser to save the complete page (html, js, css, and everything else) the downloaded html does contain the interesting information.

I have two questions:

How can I programmatically download the entire html file? Though I'm writing this program in Java, I don't mind invoking an external tool.
Why were my aforementioned attempts not yielding the HTML that I was expecting?

Thanks.

As a side note, I looked at Yahoo Finance and YQL/datatables as alternatives but that Yahoo Finance doesn't provide percentile rankings. If you look up the performance of a mutual fund, you'll see N/A values for the rankings. Yahoo Finance example. Unfortunately, this would preclude using YQL/datatables.

Regarding any questions of Morningstar's copyright, I'm screenscraping for personal, non commercial use, which their copyright notice allows in the last sentence of the second paragraph:

You are entitled to use the Information it contains for your private, non-commercial use only. Morningstar Copyright.

To download the morningstar webpage, I needed a tool that would download and interpret the javascript code associated with the webpage. Many such tools for different programming languages and browsers are mentioned on StackOverflow. Here are the ones that I wound up using:

htmlunit - a GUI-less browser for Java programs
htmlunitscripter - a firefox add-on that autogenerates htmlunit code

So the page makes extensive use of XMLHttpRequest to populate data which means that your scraper will have to perform javascript evaluation. If you use the developer tools in Chrome you can see the HTML used to construct the page and the JSON data used to build the tables.

For scraping this I would try to use Internet Explorer as it can host the whole page inside of it and perform javascript evaluation. There are probably other ways to use APIs such as WebKit but IE should work right out of the box.

Have you tried irobot at http://irobotsoft.com? You can verify with this:

Go to the url
Mark the data of interest
Add a take data action
Test the action and see if it extracts the data you want

They have a forum where you can ask general screenscraping questions

继续阅读：download finance screen-scraping

downloading morningstar webpages for screenscraping

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？