Relating two consecutive lines in a file
I have a txt file of repeating lines like this:
Host: http://de.wikipedia.org Referer: http://www.wikipedia.org Host: answers.yahoo.com/ Referer: http://www.yahoo.com Host: http://de.wikipedia.org Referer: http://www.wikipedia.org Host: http://maps.yahoo.com/ Referer: http://www.yahoo.com Host: http://pt.wikipedia.org Referer: http://www.wikipedia.org Host: answers.yahoo.com/ Referer: http://www.yahoo.com Host: mail.yahoo.com Referer: http://www.yahoo.com Host: http://fr.wikipedia.org Referer: http://www.wikipedia.org Host: mail.yahoo.com Referer: http://www.yahoo.com
I am trying with this piece of code to go through the lines and see how many hosts have been accessed through the same referrer:
dd = {}
for line in open('hosts.txt'):
if line.startswith('Host'):
host = line.split(':')[1].strip('\n')
elif line.startswith('Referer'):
referer = line.split(': ')[1].strip('\n')
dd.setdefault(referer, [0 , host])
dd[referer][0] += 1
print dd
e.g.from wikipedia.org, how many links or domains have been accessed.
I want only the first occurrence of any referrer, and for the hosts belonging to that referrer I want the sum of all of them, ignoring the host that has been already counted for the same referrer, so basically whenever the referrer and the host are the same and they have been already counted, I want t开发者_JS百科hem to be ignored, to have 'referrer' as key and sum of unique hosts as values, as in below:
{'http://www.wikipedia.org': 3 , 'www.yahoo.com' : 2}
The problem with my code is that it sums all the repeating hosts for the same referrer because I can't figure out how to relate the Host and Referer lines. So any hint or help is highly appreciated.
You could have a set for each referrer in the dictionary, rather than just a number. This way you could just add each host to the set, and duplicates will automatically be discarded. To get the number of hosts for the referrer, get the number of elements in the set.
dd = {}
referrer = None
for line in open('hosts.txt'):
if line.startswith('Host'):
host = line.split(': ')[1].strip('\n')
elif line.startswith('Referer'):
referrer = line.split(': ')[1].strip('\n')
if referrer is not None:
dd.setdefault(referrer, set()).add(host)
referrer = None
for k, v in dd.iteritems():
print k, len(v)
精彩评论