Creating a spam list with a web crawler in python
Hey guys, I'm not trying to do anything malicious here, I just need to do some homework. I'm a fairly new programmer, I'm using python 3.0, and I having difficulty using recursion for problem-solving. I've been stuck on this question for quite a while. Here's the
Assignment:
Write a recursive method spam(url, n) that takes a url of a web page as input and a non-negative integer n, collects all the email address contained in the web page and adds them to a global dictionary variable spam_dict, and then recursively calls itself on every http link contained in the web page.
You will use a dictionary so only one copy of every email address is saved; your dictionary will store (key,value) pairs (email, email). The recursive call should use the parameter n-1 instead of n. If n = 0, you should collect the email addresses but no recursive calls should be made. The parameter n is used to limit the recursion to at most depth n.
You will need to use the solutions of the two above problems; your method spam() will call the methods links2() and emails() and possibly other functions as well.
Notes:
- running spam() directly will produce no output on the screen; to find your spam_dict, you will need to read the value of spam_dict, and you will also need to reset it to the empty dictionary before every run of spam.
- Recall how global variables are used.
Usage:
>>> spam_dict = {}
>>> spam('http://reed.cs.depaul.edu/lperkovic/csc242/test1.html',0)
>>> spam_dict.keys()
dict_keys([])
>>> spam_dict = {}
>>> spam('http://reed.cs.depaul.edu/lperkovic/csc242/test1.html',1)
>>> spam_dict.keys()
dict_keys(['lperkovic@cs.depaul.edu', 'nobody@xyz.com'])
So far, I've written a function that traverses web pages and puts all the links in a nice little list, and what I wanted to do was call that functions. And why would I use recursion on a dictionary? And how? I don't understand how n ties into all of this.
def links2(url):
content = str(urlopen(url).read())
myparser = MyHTMLParser()
myparser.feed(content)
lst = myparser.get()
mergelst = []
for link in lst:
mergelst.append(urljoin(lst[0],link))
print(mergelst)
Any input (except why spam is bad) would be greatly appreciated. Also, I realize that the above function could probably look better, if you have a way to do i开发者_Go百科t, I'm all ears. However, all I need is the point is for the program to produce the proper output.
Added:
I wrote a function that collects emails from a page, but I'm not sure how to lump .com and .edu and .org all together.
from re import findall
def emails(url):
links = str(links3(url))
# how do I construct pattern?
pattern='[A-Za-z0-9_.]+\@[A-Za-z0-9_.]+.com\.edu\.org
lst = findall(pattern,links)
print(lst)
How do I tell python that? I can't find it in the documentation.
Think about how recursion works. What you want is for your function to be able to call itself in some cases. In this case, you need to add a parameter for the recursion level to your function, and then you need to figure out what it should do in the various cases?
At the most basic level, what should it do with n=0? (hint: you've about got it already)
What should it do if n=1? You probably want to call your function again on each element of your existing list with n=0.
What about if n is greater than 1? You want to call your function again with n = n-1 on each element you've got so far.
n
would play into it, as the problem states, by limiting the recursion to a maximum "call depth".
The idea is that since you're recursively invoking the scanning for emails from an already-running scan, you build up a call stack of what called what that gets deeper and deeper as you continue to recursively call the scanner.
You don't want it to go on forever, so as one of the arguments you pass an integer that you decrement each time you make a call. When it reaches 0, you stop doing recursive calls and let the sequence of recursions unwind itself.
call 1 (args...., n=3)
call 2a (args...., n=2)
call 3 (args...., n=1)
call 4a (args..., n=0) <-- these calls won't call more scans
call 4b (args..., n=0) <-- because n=0, so this is max depth
call 2b (args...., n=2)
精彩评论