开发者

Crawling youtube user info

I'm trying to crawl Youtube to retrieve information about a group of users (approx. 200 people). I'm interested in looking for relationships between the users:

I've managed to get contact information with the following source:

import gdata.youtube
import gdata.youtube.service
from gdata.service import RequestError
from pub_author import KEY, NAME_REGEX
def get_details(name):
    yt_service = gdata.youtube.service.YouTubeService()
    yt_service.developer_key = KEY
    contact_feed = yt_service.GetYouTubeContactFeed(username=name)
    contacts = [ e.title.text for e in contact_feed.entry ]
    return contacts

I can't seem the get the other bits of information I need. The reference guide says that I can grab the XML feed from http://gdata.youtube.com/feeds/api/users/username/subscriptions?v=2 (for some arbitrary user). However, if I try to get other users' subscriptions, I get the a 403 error with the following message:

User must be logged in to access these subscriptions.

If I use the gdata API:

sub_feed = yt_service.GetYouTubeSubscriptionFeed(username=name)
sub = [ e.title.text for e in contact_feed.entry ]

then I get the same error.

How can I get these subscriptions without logging in? It should be possible, as you can access this information without logging in to the Youtube web-site.

Also, there seems to be no feed for the subscribers of particular user. Is this information available through the API?

EDIT

So, it appears this can't be done through the API. I had to do this the quick and dirty way:

for f in `cat users.txt`; do wget "www.youtube.com/profile?user=$f&view=subscriptions" --output-document subscriptions/$f.html; done

Then use this script to get out the usernames from the downloaded HTML files:

"""Extract usernames from a Youtube profile using regex"""
import re
def main():
    import sys
    lines = open(sys.argv[1]).read().split('\n')
    #
    # The html files has two <a href="..."> tags for each user: once for an 
    # image thumbnail, and once for a text link.
    # 
    users = set()
    for l in lines:
        match = re.search('<a href="/user/(?P<name>[^"]+)" onmousedown', l)
        if match:
            users.add(match.group('name'))
    users = list(users)
    users.sort()
    print users
if __name__ == '__main__':
    main()


In order to access a user's subscriptions feed without the user being logged in, the user must check the "Subscribe to a channel" checkbox under his Account Sharing settings.

Currently, there is no direct way to get a channel's subscribers through the gdata API. In fact, there has been an outstanding feature request for it that has remained open for over 3 years! See Retrieving a list of a user's subscribers?.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜