Getting data from a chart that is displayed on a website
I was asked to draw a graph like this one
using Latex (more precisely, tikz and/or pgf). This would not be a problem if I had the data, but I don't. All I have is the website from where graphs can be displayed, but I don't know how to get the data from there.
I spent the day today trying to get this data, including writing to Google and using a type of software which traces the line and infers the points of a graph, such as Datathief and DigitizeIt, but I was unsuccessful. I think the latter did not work because the lines in the graph are too thin and have more than one shade of blue. Of course, I tried to improve the picture quality using Paint and Gimp but I still couldn't make it work.
I also tried using eps2pgf, a Java script which transforms eps figures into pgf code, but even that was not working fo开发者_高级运维r the graphs I saved using Image Capture (mac) and Print Screen (Windows), and to be honest this would be my last option since it is a "brute force approach", spitting an ugly code that you can't really improve on.
After all that I decided to start learning Python, because my supervisor, the person who asked me to draw this picture using tikz, said that there is a Python code to get data from websites like this. Now I am not even sure Python will do the job (though I am happy for the excuse to learn it) and of course it takes time to learn a new language and do something like that, so I want to know whether there is really a way to get the data from that website, using preferably Python but if not, any other method.
Well, it'd be great if Google provided an API for this data! That said, you can still scrape some data out of the site. Here's how to go about it...
Install Firebug
I prefer Firebug for Firefox, but Chrome's developer tools should also work.
Investigate
First things first, let's visit the url in question and use Firebug try and see what's going on. Activate Firebug with F12 or go to Tools->Firebug->Open Firebug. Click on the Net tab first and reload the page. This shows all the requests made, and will give you some insight into how the site works. Usually flash plugins load data externally, as opposed to having it embedded in the actual plugin, and if you look at the requests you'll see request labeled POST service
. If you hover over it, firebug shows the full url and you'll see the page made a request to http://www.google.com/transparencyreport/traffic/service
. You can click on the request and look at the headers sent, the post data, the response and cookies used to perform the request.
If you look at the response, you'll see what appears to be malformed JSON. From what I can tell this appears to contain the list of normalized traffic data points. You could actually cut and paste the response out of firebug, but since this IS a python question, let's work a bit harder.
Getting the data into Python
To make the post request successfully, we'll need to do (nearly) everything the browser does. We can cheat a bit and just copy the request headers and post data out of firebug, to spoof a real request.
Headers & post data
Use triple quotes to paste multi-line strings into the shell. Copy the request headers and paste it in.
>>> headers = """ <paste headers> """
Next convert it to a dict for httplib2. I'm going to use a list comprehension (which splits the string based on newlines, then splits the line on the first : and strips trailing whitespace, which gives me a list of two-elemnt lists that dict
can convert into a dictionary), but you could do this however you want. You could manually create the dict too, I just find this faster.
>>> headers = dict([[s.strip() for s in line.split(':', 1)]
for line in headers.strip().split('\n')])
And copy in the post data.
>>> body = """ <paste post data> """
Make the request I'm going to use httplib2 but there are a few other http clients and some nice tools for scraping the web like mechanize and scrapy. We'll make the POST request using the url to the API, the headers we copied and the post data we copied from firebug. The request returns a tuple of response headers and content.
>>> import httplib2
>>> h = httplib2.Http()
>>> url = 'http://www.google.com/transparencyreport/traffic/service'
>>> resp, content = h.request(url, 'POST', body=body, headers=headers)
Massage Data
The original format is really weird and only the top bit seems to contain the data points, so I'll ditch the rest.
>>> cleaned = content.split("'")[0][4:-1] + ']'
Now that it's valid JSON, so we can deserialize it into native python data types.
>>> import json
>>> data = json.loads(cleaned)
All of the points I'm interested in are floats, so I'll filter based on that.
>>> data = [x for x in data if type(x) == float]
Process/Save Data
Now that we have our data, inspect it, do additional processing, etc...
>>> data[:5]
<<<
[44.73874282836914,
45.4061279296875,
47.5350456237793,
44.56114196777344,
46.08817672729492]
...or just save it.
>>> with open('data.json', 'w') as f:
...: f.write(json.dumps(data))
We could also plot it out using pyplot from matplotlib (or some other graphing/plotting library).
>>> import matplotlib.pyplot as plt
>>> plt.plot(data)
Conclusion
If you are just interested in a few things you can adjust the chart to display what you want and then use the request headers/post data used by the proper request to http://www.google.com/transparencyreport/traffic/service
. You'll might want to inspect the actual response closer than I did, I just discarded the parts that didn't make sense to me. Hopefully they'll expose a public API for this data.
精彩评论