开发者

"Hashtable" for Python Twitter Crawler

As part of the python twitter crawler I'm creating, I am attempting to make a "hash-table" of sorts to ensure that I 开发者_JAVA技巧don't crawl any user more than once. It is below. However, I am running into some problems. When I start crawling at the user NYTimesKrugman, I seem to crawl some users more than once. When I start crawling at the user cleversallie (in another completely independent crawl), I don't crawl any user more than once. Any insight into this behavior would be greatly appreciated!!!

from BeautifulSoup import BeautifulSoup
import re
import urllib2
import twitter

start_follower = "cleversallie" 
depth = 3

U = list()

api = twitter.Api()

def add_to_U(user):
   U.append(user)

def user_crawled(user):
   L = len(L)
   for x in (0, L):
      a = L[x]
      if a != user:
         return False
      else:
         return True

def turn_to_names(users):
    names = list()
    for u in users:
       x = u.screen_name
       names.append(x)
    return names

def test_users(users):
   new = list()
   for u in users:
      if (user_crawled):
         new.append(u)
   return new

def crawl(follower,in_depth): #main method of sorts
   if in_depth > 0:
      add_to_U(follower)
      users = api.GetFriends(follower)
      names = turn_to_names(users)
      select_users = test_users(names)
      for u in select_users[0:5]:
         crawl(u, in_depth - 1)

crawl(start_follower, depth)
for u in U:
   print u
print("Program done.")

EDIT Based on your suggestions (thank you all very much!) I have rewritten the code as the following:

import re
import urllib2
import twitter

start_follower = "NYTimesKrugman"
depth = 4

searched = set()

api = twitter.Api()

def crawl(follower, in_depth):
    if in_depth > 0:
        searched.add(follower)
        users = api.GetFriends(follower)
        names = set([str(u.screen_name) for u in users])
        names -= searched
        for name in list(names)[0:5]:
            crawl(name, in_depth-1) 

crawl(start_follower, depth)
for x in searched:
    print x
print "Program is completed."


You have a bug where you set L = to len(L), not len(U). Also, you have a bug where you will return false if the first user does not match, not if every user does not match. In Python, the same function may be written as either of the following:

def user_crawled(user):
  for a in l:
    if a == user:
      return True

  return False

def user_crawled(user):
  return user in a

The test_users function uses a user_crawled as a variable, it does not actually call it. Also, it seems you are doing the inverse of what you intend, you wish new to be populated with untested users, not tested ones. This is that function with the errors corrected:

def test_users(users):
   new = list()
   for u in users:
      if not user_crawled(u):
         new.append(u)
   return new

Using a generator function, you can further simplify the function (provided you intend on looping over the results):

def test_users(users):
   for u in users:
      if not user_crawled(u):
         yield u

You can also use the filter function:

def test_users(users):
   return filter(lambda u: not user_crawled(u), users)

Your using a list to store users, not a hash-based structure. Python provides sets for when you need a list-like structure which can never have duplicates and requires fast existence tests. Sets can also be subtracted to remove all the elements in one set from the other.

Also, your list (U) is of users, but you are matching it against user names. You need to store just the user name of each added user. Also, you are using u to represent a user at one point in the program and to represent a user name at another, you should use more meaningful variable names.

The syntactic sugar of python ends up eliminating the need for all of your functions. This is how I would rewrite the entire program:

import twitter

start_follower = "cleversallie" 
MAX_DEPTH = 3

searched = set()

api = twitter.Api()

def crawl(follower, in_depth=MAX_DEPTH):
   if in_depth > 0:
      searched.add(follower['screen_name'])

      users = api.GetFriends(follower)
      names = set([u['screen_name'] for u in users])

      names -= searched
      for name in list(names)[:5]:
         crawl(name, in_depth - 1)

crawl(start_follower)

print "\n".join(searched)
print("Program done.")


The code sample you've given just plain doesn't work for starters, but I would guess your problem has something to do with not even making a hashtable (dictionary? set?).

You call L = len(L) when I cannot see anywhere else that L is defined. You then have a loop,

for x in (0, L):
      a = L[x]
      if a != user:
         return False
      else:
         return True

which will actually just execute twice, once with x = 0 and once with x = L, where L is the len(L). Needless to say when you attempt to index into L the loop will fail. That won't even happen because you have an if-else that returns either way and L is not defined anywhere.

What you are most likely looking for is a set with a check for the user, do some work if they're absent, then add the user. This might look like:

first_user = 'cleversallie'
crawled_users =  {first_user} #set literal

def crawl(user, depth, max_depth):
    friends = get_friends(first_user)
    for friend in friends:
        if friend not in crawled_users and depth < max_depth:
            crawled_users.add(friend)
            crawl(friend, depth + 1, max_depth)

crawl(first_user, 0, 5)

You can fill in the details of what happens in get friends. Haven't tested this so pardon any syntax errors but it should be a strong start for you.


Let's start by saying there's lots of errors in this code a lot of non-python isms.

For instance:

def user_crawled(user):
  L = len(U)
  for x in (0, L):
    a = L[x]
    if a != user:
      return False
    else:
      return True

This iterates only once through the loop... So you really ment something like [adding range() and the ability to check all the users.

def user_crawled(user) :
  L = len(U)
  for x in range(0, L) :
    a = L[x]
    if a == user :
       return True
  return False

Now of course a slightly more python way would be to skip the range and just iterate over the loop.

def user_crawled(user) :
  for a in U :
    if a == user :
      return True
  return False

Which is nice an simple, but now in true python you would jump on the "in" operator and write:

def user_crawled(user) :
  return user in U

A few more python thoughts - list comprehensions.

 def test_user(users) :
   return [u for u in users if user_crawled(u)]

Which could also be applied to turn_to_names() - left as an exercise to the reader.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜