Is there any library which provides similar functions as twill and mechanized but has better quality [closed]

2023-02-20 03:29 问答作者：

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 11 years ago.

I am trying to write some test script for a web application. So I tried use twill, which turns out to be using mechanized to parse html. But it really lets me down. For example, somehow, it couldn't correctly recognize a form on web page requires method "POST", not "GET".

So, is there any better alternatives other than directly using urllib2?

Edit, this this form that twill can not recognize.

<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML Transitional//EN'
'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'>
<html>
<head>
<meta http-equiv='Content-type' content='text/html; charset=utf-8' />
<title> Login  </title>
<link rel='stylesheet' href='/assets/styles/default.css' type='text/css'/>
<link rel='stylesheet' href='/assets/styles/button.css' type='text/css'/>

<link rel='stylesheet' href='/assets/styles/login.css' type='text/css'/>

<link rel='icon' type='image/x-icon' href='/assets/favicon.ico' /> 
</head>
<body >

<div id='content_area'>

<div id='outter_login_area'>
    <div id='inner_login_area'>
        <div id='login_head'>
            <div>
            Login with your
            </div>
            <div id='login_head_bottom'>
                <img src='/assets/images/header_logo.png' class='login_logo'/>
                <b>Admin Account</b>
            </div>
        </div>
        <hr/>
        <form action="." method="POST" id='user_login_form'>
            <div style='display:开发者_运维知识库none'><input type='hidden' name='csrfmiddlewaretoken' value='c6b6e0ca08d53093428c61f62f51ea1f' /></div>

            <div>
                <label for="id_username">User Name</label>
                <input id="id_username" type="text" name="username" maxlength="30" />

            </div>
            <div>
                <label for="id_password">Password</label>
                <input type="password" name="password" id="id_password" />

            </div>
            <div id='submit_bar'>
                <input name='submit_button' type="submit" value="Submit" class="button blue"/>
            </div>
        </form>
    </div>
</div>

</div>
</body>
</html>

This is what twill says:

In [5]: br.go('http://localhost:8000/')
==> at http://localhost:8000/accounts/login/?next=/chancellor/

In [6]: br.get_all_forms() 
Out[6]: [<_mechanize_dist.ClientForm.HTMLForm instance at 0x03112F80>]

In [7]: br.get_all_forms()[0] 
Out[7]: <_mechanize_dist.ClientForm.HTMLForm instance at 0x03112F80>

In [8]: br.get_all_forms()[0].method 
Out[8]: 'GET'

Well, that's not true.

mechanize can recognize whether the method of the form is POST or GET just fine. It looks the "method" attribute of the <form> tag to do so.

So, if it didn't work for your particular case, you'd have to look what's wrong. Can you provide the HTML source code for the page you're trying to use? I suspect the form isn't declared as POST, otherwise mechanize would detect it.

That said, if you're looking for alternatives, I like to use scrapy for web scraping. It's a fast high-level screen scraping and web crawling framework, written from ground up with the purpose of crawling websites to extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

EDIT:

I saved your html snippet to /tmp/test.html and ran the following code:

import mechanize
br = mechanize.Browser()
br.open('file:///tmp/test.html')
br.select_form(nr=0)
print br.method

I get POST as result. So I can't reproduce your issue.

Are you sure it's that page you're parsing?

EDIT 2:

Your HTML is broken. The line immediatelly above <form> is malformed. It contains <hr/> and it should contain either <hr> or <hr /> in order to be parsed correctly.

Here's how to fix it when parsing:

import mechanize
br = mechanize.Browser()
response = br.open('file:///tmp/test.html')

# fix the page so it is correctly parsed:
response.set_data(response.get_data().replace('<hr/>', '<hr />'))
br.set_response(response)

br.select_form(nr=0)
print br.method

继续阅读：python

Is there any library which provides similar functions as twill and mechanized but has better quality [closed]

EDIT:

EDIT 2:

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

EDIT:

EDIT 2:

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？