How to scrape web pages that are in different format/layouts?

2022-12-09 10:36 问答作者：

I need to scrape Form 10-K reports (i.e. annual reports of US companies) from SEC website for a project.

The trouble is, companies do not use the exact same format for filing this data. So for ex., real estate data for 开发者_运维问答2 different companies could be displayed as below

1st company

Property name   State  City     Ownership   Year  Occupancy Total Area
-------------   -----  ------   ---------   ----  --------- ----------
ABC Mall         TX    Dallas   Fee         2007    97%       1,347,377
XYZ Plaza        CA    Ontario  Fee         2008    85%       2,252,117



2nd company

Property          % Ownership  %Occupany  Rent   Square Feet
---------------   -----------  ---------  -----  -----------
New York City
  ABC Plaza       100.0%        89.0%     38.07    2,249,000 
  123 Stores      100.0%        50.0%     18.00    1,547,000 
Washington DC Office
  12th street     .......
  2001, J Drive   .......

etc.

Likewise, the data layout could be entirely different for other companies.

I would like to know if there are better ways to scrape this type of heterogenous data other than writing complex regex searches.

I have the liberty to use Java, Perl, Python or Groovy for this work.

I'd be inclined to keep a library of meta files that describe the layout for each page you want to scrape data from and use it when trying to get the data.

In that way you don't need complex reg-ex commands and if a site changes its design you simply change a single one of your files.

How you decide to create the meta file is up to you but things like pertinent class names or tags might be a good start.

then describe how to extract the data from that tag.

Unsure if there is a tool out there that does all that.

The other, nicer, way might be to contact the owners of these sites and see if they provide a feed in the form of a WebService or something that you can use to get the data. Saves a lot of heartache I should think.

继续阅读：screen-scraping

How to scrape web pages that are in different format/layouts?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？