How to scrape web pages that are in different format/layouts?
I need to scrape Form 10-K reports (i.e. annual reports of US companies) from SEC website for a project.
The trouble is, companies do not use the exact same format for filing this data. So for ex., real estate data for 开发者_运维问答2 different companies could be displayed as below
1st company
Property name State City Ownership Year Occupancy Total Area
------------- ----- ------ --------- ---- --------- ----------
ABC Mall TX Dallas Fee 2007 97% 1,347,377
XYZ Plaza CA Ontario Fee 2008 85% 2,252,117
2nd company
Property % Ownership %Occupany Rent Square Feet
--------------- ----------- --------- ----- -----------
New York City
ABC Plaza 100.0% 89.0% 38.07 2,249,000
123 Stores 100.0% 50.0% 18.00 1,547,000
Washington DC Office
12th street .......
2001, J Drive .......
etc.
Likewise, the data layout could be entirely different for other companies.
I would like to know if there are better ways to scrape this type of heterogenous data other than writing complex regex searches.
I have the liberty to use Java, Perl, Python or Groovy for this work.
I'd be inclined to keep a library of meta files that describe the layout for each page you want to scrape data from and use it when trying to get the data.
In that way you don't need complex reg-ex commands and if a site changes its design you simply change a single one of your files.
How you decide to create the meta file is up to you but things like pertinent class names or tags might be a good start.
then describe how to extract the data from that tag.
Unsure if there is a tool out there that does all that.
The other, nicer, way might be to contact the owners of these sites and see if they provide a feed in the form of a WebService or something that you can use to get the data. Saves a lot of heartache I should think.
精彩评论