开发者

How can I scrape data from the Israeli Bureau of Statistics web query tool?

The following url:

http://www.开发者_如何学JAVAcbs.gov.il/ts/ID40d250e0710c2f/databank/series_func_e_v1.html?level_1=31&level_2=1&level_3=7

Gives a data generator of information from the Israeli government which limits the number of data points extracted to a maximum of 50 series at a time. I wonder, is it possible (and if so, how) to write a webscraper (in your favorite language/software) that can follow the clicks on each step to be able to get all of the series in a specific topic.

Thanks.


Take a look at WWW::Mechanize and WWW::HtmlUnit.

#!/usr/bin/perl

use strict;
use warnings;

use WWW::Mechanize;

my $m = WWW::Mechanize->new;

#get page
$m->get("http://www.cbs.gov.il/ts/ID40d250e0710c2f/databank/series_func_e_v1.html?level_1=31&level_2=1&level_3=7");

#submit the form on the first page
$m->submit_form(
    with_fields => {
        name_tatser => 2, #Orders for export
    }
);

#now that we have the second page, submit the form on it
$m->submit_form(
    with_fields => {
        name_ser => 1576, #Number of companies that answered
    }
);

#and so on...

#printing the source HTML is a good way
#to find out what you need to do next
print $m->content;


To submit the forms, you can use Python's mechanize module:

import mechanize
import pprint
import lxml.etree as ET
import lxml.html as lh
import urllib
import urllib2

browser=mechanize.Browser()
browser.open("http://www.cbs.gov.il/ts/ID40d250e0710c2f/databank/series_func_e_v1.html?level_1=31&level_2=1&level_3=7")
browser.select_form(nr=0)

Here we peek at the options available:

pprint.pprint(browser.form.controls[-2].items)
# [<Item name='1' id=None selected='selected' contents='Volume of orders for the domestic market' value='1' label='Volume of orders for the domestic market'>,
#  <Item name='2' id=None contents='Orders for export' value='2' label='Orders for export'>,
#  <Item name='3' id=None contents='The volume of production' value='3' label='The volume of production'>,
#  <Item name='4' id=None contents='The volume of sales' value='4' label='The volume of sales'>,
#  <Item name='5' id=None contents='Stocks of finished goods' value='5' label='Stocks of finished goods'>,
#  <Item name='6' id=None contents='Access to credit for the company' value='6' label='Access to credit for the company'>,
#  <Item name='7' id=None contents='Change in the number of employees' value='7' label='Change in the number of employees'>]

choices=[item.attrs['value'] for item in browser.form.controls[-2].items]
print(choices)
# ['1', '2', '3', '4', '5', '6', '7']

browser.form['name_tatser']=['2']
browser.submit()

We can repeat this for each of the subsequent forms:

browser.select_form(nr=1)

choices=[item.attrs['value'] for item in browser.form.controls[-2].items]
print(choices)
# ['1576', '1581', '1594', '1595', '1596', '1598', '1597', '1593']

browser.form['name_ser']=['1576']
browser.submit()

browser.select_form(nr=2)

choices=[item.attrs['value'] for item in browser.form.controls[-2].items]
print(choices)
# ['32', '33', '34', '35', '36', '37', '38', '39', '40', '41']

browser.form['data_kind']=['33']
browser.submit()

browser.select_form(nr=3)
browser.form['ybegin']=['2010']
browser.form['mbegin']=['1']
browser.form['yend']=['2011']
browser.form['mend']=['5']
browser.submit()

At this point you have three options:

  1. Parse the data from the HTML source
  2. Download an .xls file
  3. Download an XML file

I don't have any experience parsing .xls in Python, so I passed over this option.

Parsing the HTML is possible with BeautifulSoup or lxml. Perhaps this would have been the shortest solution, but finding the right XPaths for the HTML was not immediately clear to me, so I went for the XML:

To download the XML from the cbs.gov.il website, one clicks on a button that calls a javascript function. Uh oh -- mechanize can not execute javascript functions. Thankfully, the javascript merely assembles a new url. Pulling out the parameters with lxml is easy:

content=browser.response().read()
doc=lh.fromstring(content)
params=dict((elt.attrib['name'],elt.attrib['value']) for elt in doc.xpath('//input'))
params['king_format']=2
url='http://www.cbs.gov.il/ts/databank/data_ts_format_e.xml'
params=urllib.urlencode(dict((p,params[p]) for p in [
    'king_format',
    'tod',
    'time_unit_list',
    'mend',
    'yend',
    'co_code_list',
    'name_tatser_list',
    'ybegin',
    'mbegin',
    'code_list',
    'co_name_tatser_list',
    'level_1',
    'level_2',
    'level_3']))

browser.open(url+'?'+params)
content=browser.response().read()

Now we reach another stumbling block: the XML is encoded in iso-8859-8-i. Python does not recognize this encoding. Not knowing what to do, I simply replaced iso-8859-8-i with iso-8859-8. I don't know what bad side-effects this might cause.

# A hack, since I do not know how to deal with iso-8859-8-i
content=content.replace('iso-8859-8-i','iso-8859-8')
doc=ET.fromstring(content)

Once you get this far, parsing the XML is easy:

for series in doc.xpath('/series_ts/Data_Set/Series'):
    print(series.attrib)
    # {'calc_kind': 'Weighted',
    #  'name_ser': 'Number Of Companies That Answered',
    #  'get_time': '2011-06-21',
    #  'name_topic': "Business Tendency Survey - Distributions Of Businesses By Industry, Kind Of Questions And Answers  - Manufacturing - Company'S Experience Over The Past Three Months - Orders For Export",
    #  'time_unit': 'Month',
    #  'code_series': '22978',
    #  'data_kind': '5-10 Employed Persons',
    #  'decimals': '0',
    #  'unit_kind': 'Number'}

    for elt in series.xpath('obs'):
        print(elt.attrib)
        # {'time_period': ' 2010-12', 'value': '40'}
        # {'time_period': ' 2011-01', 'value': '38'}
        # {'time_period': ' 2011-02', 'value': '40'}
        # {'time_period': ' 2011-03', 'value': '36'}
        # {'time_period': ' 2011-04', 'value': '30'}
        # {'time_period': ' 2011-05', 'value': '33'}


You should also take a look at Scrapy, which is a web crawler framework for Python. See 'Scrapy at a glance' for an introduction: http://doc.scrapy.org/intro/overview.html

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜