开发者

Extracting Fields Names of an HTML form - Python

Assume that there is a link "http://www.someHTMLPageWithTwoForms.com" which is basically a HTML page having two forms (say Form 1 and Form 2). I have a code like this ...

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
h = httplib2.Http('.cache')
response, content = h.request('http://www.someHTMLPageWithTwoForms.com')
for field in BeautifulSoup(content, par开发者_StackOverflow中文版seOnlyThese=SoupStrainer('input')):
        if field.has_key('name'):
                print field['name']

This returns me all the field names that belong both to the Form 1 and Form 2 of my HTML page. Is there any way I can get only the Field names that belong to a particular form (say Form 2 only)?


If it's only 2 forms you may try this one:

from BeautifulSoup import BeautifulSoup

forms = BeautifulSoup(content).findAll('form')
for field in forms[1]:
    if field.has_key('name'):
            print field['name']

If it's not only about the 2nd form you make it more specific (by an id or class attributs

from BeautifulSoup import BeautifulSoup

forms = BeautifulSoup(content).findAll(attrs={'id' : 'yourFormId'})
for field in forms[0]:
    if field.has_key('name'):
            print field['name']


Doing this kind of parsing would also be quite easy using lxml (which i personally prefer over BeautifulSoup because of its Xpath support). For example, the following snippet would print all fields names (if they have one) which belong to forms named "form2":

# you can ignore this part, it's only here for the demo
from StringIO import StringIO
HTML = StringIO("""
<html>
<body>
    <form name="form1" action="/foo">
        <input name="uselessInput" type="text" />
    </form>
    <form name="form2" action="/bar">
        <input name="firstInput" type="text" />
        <input name="secondInput" type="text" />
    </form>
</body>
</html>
""")

# here goes the useful code
import lxml.html
tree = lxml.html.parse(HTML) # you can pass parse() a file-like object or an URL
root = tree.getroot()
for form in root.xpath('//form[@name="form2"]'):
    for field in form.getchildren():
        if 'name' in field.keys():
            print field.get('name')


If you have lxml and cssselect python packages installed:

from lxml import html
def parse_form(form):
    tree = html.fromstring(form)
    data = {}
    for e in tree.cssselect('form input'):
        if e.get('name'):
            data[e.get('name')] = e.get('value')
    return data


If you have attribute name and value, you can search

from BeautifulSoup import BeautifulStoneSoup
xml = '<person name="Bob"><parent rel="mother" name="Alice">'
xmlSoup = BeautifulStoneSoup(xml)

xmlSoup.findAll(name="Alice")
# []
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜