How can I scrape data from the Israeli Bureau of Statistics web query tool?
The following url:
http://www.开发者_如何学JAVAcbs.gov.il/ts/ID40d250e0710c2f/databank/series_func_e_v1.html?level_1=31&level_2=1&level_3=7
Gives a data generator of information from the Israeli government which limits the number of data points extracted to a maximum of 50 series at a time. I wonder, is it possible (and if so, how) to write a webscraper (in your favorite language/software) that can follow the clicks on each step to be able to get all of the series in a specific topic.
Thanks.
Take a look at WWW::Mechanize and WWW::HtmlUnit.
#!/usr/bin/perl
use strict;
use warnings;
use WWW::Mechanize;
my $m = WWW::Mechanize->new;
#get page
$m->get("http://www.cbs.gov.il/ts/ID40d250e0710c2f/databank/series_func_e_v1.html?level_1=31&level_2=1&level_3=7");
#submit the form on the first page
$m->submit_form(
with_fields => {
name_tatser => 2, #Orders for export
}
);
#now that we have the second page, submit the form on it
$m->submit_form(
with_fields => {
name_ser => 1576, #Number of companies that answered
}
);
#and so on...
#printing the source HTML is a good way
#to find out what you need to do next
print $m->content;
To submit the forms, you can use Python's mechanize module:
import mechanize
import pprint
import lxml.etree as ET
import lxml.html as lh
import urllib
import urllib2
browser=mechanize.Browser()
browser.open("http://www.cbs.gov.il/ts/ID40d250e0710c2f/databank/series_func_e_v1.html?level_1=31&level_2=1&level_3=7")
browser.select_form(nr=0)
Here we peek at the options available:
pprint.pprint(browser.form.controls[-2].items)
# [<Item name='1' id=None selected='selected' contents='Volume of orders for the domestic market' value='1' label='Volume of orders for the domestic market'>,
# <Item name='2' id=None contents='Orders for export' value='2' label='Orders for export'>,
# <Item name='3' id=None contents='The volume of production' value='3' label='The volume of production'>,
# <Item name='4' id=None contents='The volume of sales' value='4' label='The volume of sales'>,
# <Item name='5' id=None contents='Stocks of finished goods' value='5' label='Stocks of finished goods'>,
# <Item name='6' id=None contents='Access to credit for the company' value='6' label='Access to credit for the company'>,
# <Item name='7' id=None contents='Change in the number of employees' value='7' label='Change in the number of employees'>]
choices=[item.attrs['value'] for item in browser.form.controls[-2].items]
print(choices)
# ['1', '2', '3', '4', '5', '6', '7']
browser.form['name_tatser']=['2']
browser.submit()
We can repeat this for each of the subsequent forms:
browser.select_form(nr=1)
choices=[item.attrs['value'] for item in browser.form.controls[-2].items]
print(choices)
# ['1576', '1581', '1594', '1595', '1596', '1598', '1597', '1593']
browser.form['name_ser']=['1576']
browser.submit()
browser.select_form(nr=2)
choices=[item.attrs['value'] for item in browser.form.controls[-2].items]
print(choices)
# ['32', '33', '34', '35', '36', '37', '38', '39', '40', '41']
browser.form['data_kind']=['33']
browser.submit()
browser.select_form(nr=3)
browser.form['ybegin']=['2010']
browser.form['mbegin']=['1']
browser.form['yend']=['2011']
browser.form['mend']=['5']
browser.submit()
At this point you have three options:
- Parse the data from the HTML source
- Download an .xls file
- Download an XML file
I don't have any experience parsing .xls in Python, so I passed over this option.
Parsing the HTML is possible with BeautifulSoup
or lxml
. Perhaps
this would have been the shortest solution, but finding the right XPaths for the HTML was not immediately clear to me, so I went for the XML:
To download the XML from the cbs.gov.il website, one clicks on a button that calls a javascript function. Uh oh -- mechanize can not execute javascript functions. Thankfully, the javascript merely assembles a new url. Pulling out the parameters with lxml
is easy:
content=browser.response().read()
doc=lh.fromstring(content)
params=dict((elt.attrib['name'],elt.attrib['value']) for elt in doc.xpath('//input'))
params['king_format']=2
url='http://www.cbs.gov.il/ts/databank/data_ts_format_e.xml'
params=urllib.urlencode(dict((p,params[p]) for p in [
'king_format',
'tod',
'time_unit_list',
'mend',
'yend',
'co_code_list',
'name_tatser_list',
'ybegin',
'mbegin',
'code_list',
'co_name_tatser_list',
'level_1',
'level_2',
'level_3']))
browser.open(url+'?'+params)
content=browser.response().read()
Now we reach another stumbling block: the XML is encoded in iso-8859-8-i
.
Python does not recognize this encoding. Not knowing what to do, I simply replaced iso-8859-8-i
with iso-8859-8
. I don't know what bad side-effects this might cause.
# A hack, since I do not know how to deal with iso-8859-8-i
content=content.replace('iso-8859-8-i','iso-8859-8')
doc=ET.fromstring(content)
Once you get this far, parsing the XML is easy:
for series in doc.xpath('/series_ts/Data_Set/Series'):
print(series.attrib)
# {'calc_kind': 'Weighted',
# 'name_ser': 'Number Of Companies That Answered',
# 'get_time': '2011-06-21',
# 'name_topic': "Business Tendency Survey - Distributions Of Businesses By Industry, Kind Of Questions And Answers - Manufacturing - Company'S Experience Over The Past Three Months - Orders For Export",
# 'time_unit': 'Month',
# 'code_series': '22978',
# 'data_kind': '5-10 Employed Persons',
# 'decimals': '0',
# 'unit_kind': 'Number'}
for elt in series.xpath('obs'):
print(elt.attrib)
# {'time_period': ' 2010-12', 'value': '40'}
# {'time_period': ' 2011-01', 'value': '38'}
# {'time_period': ' 2011-02', 'value': '40'}
# {'time_period': ' 2011-03', 'value': '36'}
# {'time_period': ' 2011-04', 'value': '30'}
# {'time_period': ' 2011-05', 'value': '33'}
You should also take a look at Scrapy, which is a web crawler framework for Python. See 'Scrapy at a glance' for an introduction: http://doc.scrapy.org/intro/overview.html
精彩评论