How can I scrape data from the Israeli Bureau of Statistics web query tool?

2023-03-14 22:23 问答作者：

The following url:

http://www.开发者_如何学JAVAcbs.gov.il/ts/ID40d250e0710c2f/databank/series_func_e_v1.html?level_1=31&level_2=1&level_3=7

Gives a data generator of information from the Israeli government which limits the number of data points extracted to a maximum of 50 series at a time. I wonder, is it possible (and if so, how) to write a webscraper (in your favorite language/software) that can follow the clicks on each step to be able to get all of the series in a specific topic.

Thanks.

Take a look at WWW::Mechanize and WWW::HtmlUnit.

#!/usr/bin/perl

use strict;
use warnings;

use WWW::Mechanize;

my $m = WWW::Mechanize->new;

#get page
$m->get("http://www.cbs.gov.il/ts/ID40d250e0710c2f/databank/series_func_e_v1.html?level_1=31&level_2=1&level_3=7");

#submit the form on the first page
$m->submit_form(
    with_fields => {
        name_tatser => 2, #Orders for export
    }
);

#now that we have the second page, submit the form on it
$m->submit_form(
    with_fields => {
        name_ser => 1576, #Number of companies that answered
    }
);

#and so on...

#printing the source HTML is a good way
#to find out what you need to do next
print $m->content;

To submit the forms, you can use Python's mechanize module:

import mechanize
import pprint
import lxml.etree as ET
import lxml.html as lh
import urllib
import urllib2

browser=mechanize.Browser()
browser.open("http://www.cbs.gov.il/ts/ID40d250e0710c2f/databank/series_func_e_v1.html?level_1=31&level_2=1&level_3=7")
browser.select_form(nr=0)

Here we peek at the options available:

pprint.pprint(browser.form.controls[-2].items)
# [<Item name='1' id=None selected='selected' contents='Volume of orders for the domestic market' value='1' label='Volume of orders for the domestic market'>,
#  <Item name='2' id=None contents='Orders for export' value='2' label='Orders for export'>,
#  <Item name='3' id=None contents='The volume of production' value='3' label='The volume of production'>,
#  <Item name='4' id=None contents='The volume of sales' value='4' label='The volume of sales'>,
#  <Item name='5' id=None contents='Stocks of finished goods' value='5' label='Stocks of finished goods'>,
#  <Item name='6' id=None contents='Access to credit for the company' value='6' label='Access to credit for the company'>,
#  <Item name='7' id=None contents='Change in the number of employees' value='7' label='Change in the number of employees'>]

choices=[item.attrs['value'] for item in browser.form.controls[-2].items]
print(choices)
# ['1', '2', '3', '4', '5', '6', '7']

browser.form['name_tatser']=['2']
browser.submit()

We can repeat this for each of the subsequent forms:

browser.select_form(nr=1)

choices=[item.attrs['value'] for item in browser.form.controls[-2].items]
print(choices)
# ['1576', '1581', '1594', '1595', '1596', '1598', '1597', '1593']

browser.form['name_ser']=['1576']
browser.submit()

browser.select_form(nr=2)

choices=[item.attrs['value'] for item in browser.form.controls[-2].items]
print(choices)
# ['32', '33', '34', '35', '36', '37', '38', '39', '40', '41']

browser.form['data_kind']=['33']
browser.submit()

browser.select_form(nr=3)
browser.form['ybegin']=['2010']
browser.form['mbegin']=['1']
browser.form['yend']=['2011']
browser.form['mend']=['5']
browser.submit()

At this point you have three options:

Parse the data from the HTML source
Download an .xls file
Download an XML file

I don't have any experience parsing .xls in Python, so I passed over this option.

Parsing the HTML is possible with BeautifulSoup or lxml. Perhaps this would have been the shortest solution, but finding the right XPaths for the HTML was not immediately clear to me, so I went for the XML:

To download the XML from the cbs.gov.il website, one clicks on a button that calls a javascript function. Uh oh -- mechanize can not execute javascript functions. Thankfully, the javascript merely assembles a new url. Pulling out the parameters with lxml is easy:

content=browser.response().read()
doc=lh.fromstring(content)
params=dict((elt.attrib['name'],elt.attrib['value']) for elt in doc.xpath('//input'))
params['king_format']=2
url='http://www.cbs.gov.il/ts/databank/data_ts_format_e.xml'
params=urllib.urlencode(dict((p,params[p]) for p in [
    'king_format',
    'tod',
    'time_unit_list',
    'mend',
    'yend',
    'co_code_list',
    'name_tatser_list',
    'ybegin',
    'mbegin',
    'code_list',
    'co_name_tatser_list',
    'level_1',
    'level_2',
    'level_3']))

browser.open(url+'?'+params)
content=browser.response().read()

Now we reach another stumbling block: the XML is encoded in iso-8859-8-i. Python does not recognize this encoding. Not knowing what to do, I simply replaced iso-8859-8-i with iso-8859-8. I don't know what bad side-effects this might cause.

# A hack, since I do not know how to deal with iso-8859-8-i
content=content.replace('iso-8859-8-i','iso-8859-8')
doc=ET.fromstring(content)

Once you get this far, parsing the XML is easy:

for series in doc.xpath('/series_ts/Data_Set/Series'):
    print(series.attrib)
    # {'calc_kind': 'Weighted',
    #  'name_ser': 'Number Of Companies That Answered',
    #  'get_time': '2011-06-21',
    #  'name_topic': "Business Tendency Survey - Distributions Of Businesses By Industry, Kind Of Questions And Answers  - Manufacturing - Company'S Experience Over The Past Three Months - Orders For Export",
    #  'time_unit': 'Month',
    #  'code_series': '22978',
    #  'data_kind': '5-10 Employed Persons',
    #  'decimals': '0',
    #  'unit_kind': 'Number'}

    for elt in series.xpath('obs'):
        print(elt.attrib)
        # {'time_period': ' 2010-12', 'value': '40'}
        # {'time_period': ' 2011-01', 'value': '38'}
        # {'time_period': ' 2011-02', 'value': '40'}
        # {'time_period': ' 2011-03', 'value': '36'}
        # {'time_period': ' 2011-04', 'value': '30'}
        # {'time_period': ' 2011-05', 'value': '33'}

You should also take a look at Scrapy, which is a web crawler framework for Python. See 'Scrapy at a glance' for an introduction: http://doc.scrapy.org/intro/overview.html

继续阅读：perl r screen-scraping security

How can I scrape data from the Israeli Bureau of Statistics web query tool?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？