What's a good way to select a random set of twitterers?
Considering the set of Twitter users "nodes" and the relation u follows v
as the "edges", we have a graph from which I would like to select a subset of the users at random. I could be wrong, but from reading the API docs I think it's impossible to get a collection of users except by gettin开发者_如何学运维g the followers or friends of an already-known user.
So, starting from myself and exploring the Twitter graph from there, what's a good way to select a random sample of (say 100) users?
I would use the numerical user id
. Generate a bunch of random numbers, and fetch users based on that. If you hit a nonexistent id, simply skip that.
The Twitter API wiki, for users/show:
id. The ID or screen name of a user.
Twitter's streaming API has an endpoint called "Sample" which Returns a small random sample of all public statuses
(cf. https://dev.twitter.com/docs/api/1.1/get/statuses/sample)
Authors twitter Ids are returned with the tweets, so this would get you random active twitter users.
You can use GET statuses/sample to get a continuos stream of tweets from twitter being posted while your code is executing. You can then extract the user (tweeter) from the tweet information received
Here is the python code to do so using the Python twitter api
import twitter
f=open("account","r") #this file should contain "consumer_key consumer_secret access_token_key access_token_secret"
acc=f.read().split()
f.close()
api=twitter.Api(consumer_key=acc[0], consumer_secret=acc[1], access_token_key=acc[2], access_token_secret=acc[3])
lis = api.GetStreamSample()
cnt = 0
userIDs = []
for tweet in lis:
# stop after getting 100 tweets. You can adjust this to any number
if cnt == 100:
break;
cnt += 1
userIDs.append(tweet['user']['id'])
userIDs = list(set(userIDs)) # To remove any duplicated user IDs
print userIDs
Assuming the six degrees of separation is true, you could do a Breadth first search upto 6 levels and select 100 random users from that list. Or you could say, I will stop looking for more users when I get say, a million unique users and sample 100 from that.
Since storing a list of million users and trying to sample might be prohibitive, there is a technique called Reservoir Sampling which you can use, that allows you to sample during the traversal itself.
Just query the public timeline, and use the set of users returned:
http://apiwiki.twitter.com/Twitter-REST-API-Method%3A-statuses-public_timeline
It won't be random, since it's just the last 20 tweets sent by anyone, but it will most likely never be the same set of users twice.
Since it only gives you 20 at a time, and the results are cached on their servers for 60 seconds, you'll have to do 5 different requests with a 60 second pause in between them.
Of course, it's also possible that some users will be tweeting frequently in a certain time period, so you might get less than 100 users total in that time, so you could just loop until you've gotten 100, if you need to.
Unless you have the entire twitter user graph (or a random sample of it), you won't be able to take a random sample. Otherwise, any sample you take will be biased by its relationship to you.
You may use this repo, [Random Twitter Handles Generator], to generate random twitter handles(usernames) for a specific country.
Random handles are generated based on:
- country name
- specified number of random coordinate points in that country
- radius of the given latitude/longitude(coordinate point) in km (tweets will be within that radius)
- specified number of tweets to get per a coordinate point
- language of the tweets
精彩评论