Getting a steady flow of messages from twitter
I'd like to try to make a simple twitter client that learns my tastes and automatically finds friends and interesting tweets to provide me with relevant informa开发者_StackOverflowtion.
To get started, I would need to get a good stream of random twitter messages, so I can test a few machine learning algorithms on them.
What API methods should I use for this? Do I have to poll regularly to get messages, or is there a way to get twitter to push messages as they are published?
I'd also be interested in learning about any similar project.
I use tweepy to access Twitter API and listen to the public stream they provide -- which should be a one-percent-sample of all tweets. Here is my sample code that I use myself. You can still use the basic auth mechanism for streaming, though they may change that soon. Change the USERNAME and PASSWORD variables accordingly and make sure you respect the error codes that Twitter returns (this sample code might not be respecting the exponential backoff mechanism that Twitter wants in some cases).
import tweepy
import time
def log_error(msg):
timestamp = time.strftime('%Y%m%d:%H%M:%S')
sys.stderr.write("%s: %s\n" % (timestamp,msg))
class StreamWatcherListener(tweepy.StreamListener):
def on_status(self, status):
print status.text.encode('utf-8')
def on_error(self, status_code):
log_error("Status code: %s." % status_code)
time.sleep(3)
return True # keep stream alive
def on_timeout(self):
log_error("Timeout.")
def main():
auth = tweepy.BasicAuthHandler(USERNAME, PASSWORD)
listener = StreamWatcherListener()
stream = tweepy.Stream(auth, listener)
stream.sample()
if __name__ == '__main__':
try:
main()
except KeyboardInterrupt:
break
except Exception,e:
log_error("Exception: %s" % str(e))
time.sleep(3)
I also set the timeout of the socket module, I believe I had some problems with the default timeout behavior in Python, so be careful.
import socket
socket.setdefaulttimeout(timeout)
I don't think you can get access to the world twitter timeline. But you can certainly look at your friends tweets and setup lists to play with, I would recommend using the Twitter4J library http://twitter4j.org/en/index.html
I might have been mistaken, getPublicTimeline() might be what you want.
Twitter has a streaming API for just this purpose. They provide a small random sample of all messages posted to twitter, continually updated in a 'push' manner as you describe. If you are doing this for some kind of noble purpose then you can request access from Twitter to a larger sample.
From the API docs, you want statuses/sample
:
statuses/sample
Returns a random sample of all public statuses. The default access level, ‘Spritzer’ provides a small proportion of the Firehose, very roughly, 1% of all public statuses. The “Gardenhose” access level provides a proportion more suitable for data mining and research applications that desire a larger proportion to be statistically significant sample. Currently Gardenhose returns, very roughly, 10% of all public statuses. Note that these proportions are subject to unannounced adjustment as traffic volume varies.
URL: http://stream.twitter.com/1/statuses/sample.json
Method(s): GET
Parameters: count, delimited
Returns: stream of status element
Personally, I've had some success using the python library tweepy to use the streaming API.
import tweepy, sys, time
ckey = ''
csecret = ''
atoken = ''
asecret = ''
def log_error(msg):
timestamp = time.strftime('%Y%m%d:%H%M:%S')
sys.stderr.write("%s: %s\n" % (timestamp,msg))
class StreamWatcherListener(tweepy.StreamListener):
def on_data(self, status):
try: #Some of the object are deletion of tweet, won't have 'text' in the dict
print getData['text']
except Exception, e:
pass
#print text.encode('utf-8')
def on_error(self, status_code):
log_error("Status code: %s." % status_code)
time.sleep(3)
return True # keep stream alive
def on_timeout(self):
log_error("Timeout.")
def main():
auth = tweepy.OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
listener = StreamWatcherListener()
stream = tweepy.Stream(auth, listener)
stream.sample()
if __name__ == '__main__':
try:
main()
except Exception,e:
log_error("Exception: %s" % str(e))
time.sleep(3)
Tweepy's BasicAuthHandler is deprecated. Here's a new set of code. Have fun!
精彩评论