Using Python/Selenium/Best Tool For The Job to get URI of image requests generated through JavaScript?

2022-12-10 21:38 问答作者：

I have some JavaScript from a 3rd party vendor that is initiating an image request. I would like to figure out the URI of this image request.

I can load the page in my browser, and then monitor "Li开发者_JS百科ve HTTP Headers" or "Tamper Data" in order to figure out the image request URI, but I would prefer to create a command line process to do this.

My intuition is that it might be possible using python + qtwebkit, but perhaps there is a better way.

To clarify: I might have this (overly simplified code).

<script>
suffix = magicNumberFunctionIDontHaveAccessTo();
url = "http://foobar.com/function?parameter=" + suffix
img = document.createElement('img'); img.src=url; document.all.body.appendChild(img);
</script>

Then once the page is loaded, I can go figure out the url by sniffing the packets. But I can't just figure it out from the source, because I can't predict the outcome of magicNumberFunction...().

Any help would be muchly appreciated!

Thank you.

The simplest thing to do might be to use something like HtmlUnit and skip a real browser entirely. By using Rhino, it can evaluate JavaScript and likely be used to extract that URL out.

That said, if you can't get that working, try out Selenium RC and use the captureNetworkTraffic command (which requires the Selenium instant be started with an option of captureNetworkTraffic=true). This will launch Firefox with a proxy configured and then let you pull the request info back out as JSON/XML/plain text. Then you can parse that content and get what you want.

Try out the instant test tool that my company offers. If the data you're looking for is in our results (after you click View Details), you'll be able to get it from Selenium. I know, since I wrote the captureNetworkTraffic API for Selenium for my company, BrowserMob.

I would pick any one of the many http proxy servers written in Python -- probably one of the simplest ones at the very top of the list -- and tweak it to record all URLs requested (as well as proxy-serve them) e.g. appending them to a text file -- without loss of generality, call that text file 'XXX.txt'.

Now all you need is a script that: starts the proxy server in question; starts Firefox (or whatever) on your main desired URL with the proxy in question set as your proxy (see e.g. this SO question for how), though I'm sure other browsers would work just as well; waits a bit (e.g. until the proxy's XXX.txt file has not been altered for more than N seconds); reads XXX.txt to extract only the URLs you care about and record them wherever you wish; turns down the proxy and Firefox processes.

I think this will be much faster to put in place and make work correctly, for your specific requirements, than any more general solution based on qtwebkit, selenium, or other "automation kits".

Use Firebug Firefox plugin. It will show you all requests in real time and you can even debug the JS in your Browser or run it step-by-step.

Ultimately, I did it in python, using Selenium-RC. This solution requires the python files for selenium-rc, and you need to start the java server ("java -jar selenium-server.jar")

from selenium import selenium
import unittest
import lxml.html

class TestMyDomain(unittest.TestCase):
    def setUp(self):
        self.selenium = selenium("localhost", \
            4444, "*firefox", "http://www.MyDomain.com")
        self.selenium.start()

    def test_mydomain(self):

        htmldoc = open('site-list.html').read()
        url_list = [link for (element, attribute,link,pos) in lxml.html.iterlinks(htmldoc)]
        for url in url_list:

            try: 
                sel = self.selenium
                sel.open(url)        
                sel.select_window("null")
                js_code = '''
                myDomainWindow = this.browserbot.getUserWindow();
                for(obj in myDomainWindow) {  

                   /* This code grabs the OMNITURE tracking pixel img */
                    if ((obj.substring(0,4) == 's_i_') && (myDomainWindow[obj].src)) {              
                        var ret = myDomainWindow[obj].src;
                    } 
                }        
                ret;
                '''
                omniture_url = sel.get_eval(js_code) #parse&process this however you want


            except Exception, e:
                print 'We ran into an error: %s' % (e,)


        self.assertEqual("expectedValue", observedValue)


    def tearDown(self):
        self.selenium.stop()

if __name__ == "__main__":
    unittest.main()

Why can't you just read suffix, or url for that matter? Is the image loaded in an iframe or in your page?

If it is loaded in your page, then this may be a dirty hack (substitute document.body for whatever element is considered):

var ac = document.body.appendChild;
var sources = [];

document.body.appendChild = function(child) {
    if (/^img$/i.test(child.tagName)) {
        sources.push(child.getAttribute('src'));
    }
    ac(child);
}

继续阅读：analytics http-headers python selenium

Using Python/Selenium/Best Tool For The Job to get URI of image requests generated through JavaScript?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？