re.findall how to return matches with repeated ones
I have list of IP:PORT in html and when i use findall to search all ip i get the list of all ip becouse IP are unique , some of ports are the same and i get by example list of 100 IP's and only 87 ports. How to find all ports with the repeated ones ?
proxies = re.findall("[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}",html)
ports = re.findall("[0-9]{1,3}\,[0-9]{1,3}\,[0-9]{1,3}\,[0-9]{1,3}",html)
#ports are coded to开发者_开发百科 looks like this 47,46,47,46
print len(proxies)
print len(ports)
Without seeing the source file, I can only make some basic points.
- Port numbers are not limited to 3 digits, so you are excluding any port over 999
- Do the port numbers only show up as a list of 4 ports? You said the format was a list of
IP:PORT
, but that is not what you are checking for.
EDIT:
Look at the source of the page more carefully. There are entries that do not have 4 port numbers.
<tr>
<td class="t_ip">151.9.233.6</td>
<td class="t_port">50,42</td>
<td class="t_country"><img src="/images/flags/it.png" alt="it" />Italy</td>
<td class="t_anonymity">
High
</td>
<td class="t_https">-</td>
<td class="t_checked">00:02:16</td>
<td class="t_check">
<a href="" class="a_check" >check</a>
</td>
</tr>
It also seems like it would be a lot easier to check for class="t_ip"
and class="t_port"
and grab the contents of that element.
<td class="t_ip">(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})</td>
<td class="t_port">((\d,?)+)</td>
Note: The IP address expression will match invalid IP addresses.
Not sure that this will help you too much, but just another option:
txt = """
<tr>
<td class="t_ip">151.9.233.6</td>
<td class="t_port">50,42</td>
<td class="t_country"><img src="/images/flags/it.png" alt="it" />Italy</td>
<td class="t_anonymity">
High
</td>
<td class="t_https">-</td>
<td class="t_checked">00:02:16</td>
<td class="t_check">
<a href="" class="a_check" >check</a>
</td>
</tr>
"""
txt = [line.strip() for line in txt.split('\n')]
#clstaglen = len('</td>') => 5
getVals = lambda startTxt: [line[len(startTxt):len(line)-5] for line in txt if line.startswith(startTxt)]
print getVals('<td class="t_ip">')
print getVals('<td class="t_port">')
精彩评论