Obtaining loaded images from a given URL via Python
Is there anyway to load a URL via Python a开发者_开发问答nd then retrieve a list of all of the images that were loaded via that URL? I'm essentially looking to do something similar to TamperData or Fiddler and retrieve a list of all images that a given website loaded.
Interesting task. Here's one way of solving it, along the line suggested by Jochen Ritzel.
It uses pylibpcap instead of pycap. Personally, I find pycap to be hard to work with due to little amount of documentation available. For pylibpcap, you can translate most code directly from the libpcap examples (see for example this tutorial for a nice introduction). The man pages for tcpdump and pcap are also great resources.
You may want to look at the standards for Ethernet, IPv4, TCP, and HTTP.
Note 1: The code below only prints out the HTTP GET requests. Filtering out the images and downloading them using the urllib module should pose no problem.
Note 2: This code works on Linux, not sure what device names you need to use on Windows/MacOS. You'll also need root privileges.
#!/usr/bin/env python
import pcap
import struct
def parse_packet(data):
"""
Parse Ethernet/IP/TCP packet.
"""
# See the Ethernet, IP, and TCP standards for details.
data = data[14:] # Strip Ethernet header
header_length = 4 * (ord(data[0]) & 0x0f) # in bytes
data = data[header_length:] # Strip IP header
dest_port = struct.unpack('!H', data[2:4])[0]
if not dest_port == 80: # This is an outgoing package
return
header_length = 4 * ((ord(data[12]) & 0xf0) >> 4) # in bytes
data = data[header_length:] # Strip TCP header
return data
def parse_get(data):
"""
Parse a HTTP GET request, returning the request URI.
"""
if data is None or not data.startswith('GET'):
return
fields = data.split('\n')
uri = fields[0].split()[1]
for field in fields[1:]:
if field.lower().startswith('host:'):
return field[5:].strip() + uri
def packet_handler(length, data, timestamp):
uri = parse_get(parse_packet(data))
if not uri is None:
print uri
# Set up pcap sniffer
INTERFACE = 'wlan0'
FILTER = 'tcp port 80'
p = pcap.pcapObject()
p.open_live(INTERFACE, 1600, 0, 100)
p.setfilter(FILTER, 0, 0)
try:
while True:
p.dispatch(1, packet_handler)
except KeyboardInterrupt:
pass
精彩评论