Crawling youtube user info
I'm trying to crawl Youtube to retrieve information about a group of users (approx. 200 people). I'm interested in looking for relationships between the users:
- contacts
- subscribers
- subsc开发者_StackOverflowriptions
- what videos they commented on
- etc
I've managed to get contact information with the following source:
import gdata.youtube
import gdata.youtube.service
from gdata.service import RequestError
from pub_author import KEY, NAME_REGEX
def get_details(name):
yt_service = gdata.youtube.service.YouTubeService()
yt_service.developer_key = KEY
contact_feed = yt_service.GetYouTubeContactFeed(username=name)
contacts = [ e.title.text for e in contact_feed.entry ]
return contacts
I can't seem the get the other bits of information I need. The reference guide says that I can grab the XML feed from http://gdata.youtube.com/feeds/api/users/username/subscriptions?v=2 (for some arbitrary user). However, if I try to get other users' subscriptions, I get the a 403 error with the following message:
User must be logged in to access these subscriptions.
If I use the gdata API:
sub_feed = yt_service.GetYouTubeSubscriptionFeed(username=name)
sub = [ e.title.text for e in contact_feed.entry ]
then I get the same error.
How can I get these subscriptions without logging in? It should be possible, as you can access this information without logging in to the Youtube web-site.
Also, there seems to be no feed for the subscribers of particular user. Is this information available through the API?
EDIT
So, it appears this can't be done through the API. I had to do this the quick and dirty way:
for f in `cat users.txt`; do wget "www.youtube.com/profile?user=$f&view=subscriptions" --output-document subscriptions/$f.html; done
Then use this script to get out the usernames from the downloaded HTML files:
"""Extract usernames from a Youtube profile using regex"""
import re
def main():
import sys
lines = open(sys.argv[1]).read().split('\n')
#
# The html files has two <a href="..."> tags for each user: once for an
# image thumbnail, and once for a text link.
#
users = set()
for l in lines:
match = re.search('<a href="/user/(?P<name>[^"]+)" onmousedown', l)
if match:
users.add(match.group('name'))
users = list(users)
users.sort()
print users
if __name__ == '__main__':
main()
In order to access a user's subscriptions feed without the user being logged in, the user must check the "Subscribe to a channel" checkbox under his Account Sharing settings.
Currently, there is no direct way to get a channel's subscribers through the gdata
API. In fact, there has been an outstanding feature request for it that has remained open for over 3 years! See Retrieving a list of a user's subscribers?.
精彩评论